Difficult adding up elements in a pandas DataFrame - python
I'm currently having trouble adding up the rows for the following DataFrame which I have constructed for the returns for six companies' stocks:
def importdata(data):
returns=pd.read_excel(data) # Imports the data from Excel
returns_with_dates=returns.set_index('Dates') # Sets the Dates as the df index
return returns_with_dates
which outputs:
Out[345]:
Company 1 Company 2 Company 3 Company 4 Company 5 Company 6
Dates
1997-01-02 31.087620 3.094705 24.058686 31.694404 37.162890 13.462241
1997-01-03 31.896592 3.109631 22.423629 32.064378 37.537013 13.511706
1997-01-06 31.723241 3.184358 18.803148 32.681000 37.038183 13.684925
1997-01-07 31.781024 3.199380 19.503886 33.544272 37.038183 13.660193
1997-01-08 31.607673 3.169431 19.387096 32.927650 37.537013 13.585995
1997-01-09 31.492106 3.199380 19.737465 33.420948 37.038183 13.759214
1997-01-10 32.589996 3.184358 19.270307 34.284219 37.661721 13.858235
1997-01-13 32.416645 3.199380 19.153517 35.147491 38.035844 13.660193
1997-01-14 32.301077 3.184358 19.503886 35.517465 39.407629 13.783946
1997-01-15 32.127726 3.199380 19.387096 35.887438 38.409967 13.759214
1997-01-16 32.532212 3.229232 19.737465 36.257412 39.282921 13.635460
1997-01-17 33.167833 3.259180 20.087835 37.490657 39.033505 13.858235
1997-01-20 33.456751 3.229232 20.438204 35.640789 39.657044 14.377892
1997-01-21 33.225616 3.244158 20.671783 36.010763 40.779413 14.179940
1997-01-22 33.110049 3.289033 21.489312 36.010763 40.654705 14.254138
1997-01-23 32.705563 3.199380 20.905363 35.394140 40.904121 14.229405
1997-01-24 32.127726 3.139579 20.204624 35.764114 40.405290 13.957165
1997-01-27 32.127726 3.094705 20.204624 35.270816 40.779413 13.882968
1997-01-28 31.781024 3.079778 20.788573 34.407544 41.153536 13.684925
1997-01-29 32.185510 3.094705 21.138942 34.654193 41.278244 13.858235
1997-01-30 32.647779 3.094705 21.022153 34.407544 41.652367 13.981898
1997-01-31 32.532212 3.064757 20.204624 34.037570 42.275905 13.858235
For countless hours I have tried summing them up in such a way that I add up the rows from 1997-01-02 to 1997-01-08, 1997-01-09 to 1997-01-15 etc., thus adding up the first five rows, and then the following five rows. Furthermore, I seek to keep the date as an index for the 5th element, so in the case of adding up the elements from 1997-01-02 to 1997-01-08 I seek to keep 1997-01-08 as the index corresponding to the summed up element. It is worth mentioning that I have been using the five row addition as an example, but ideally I seek to add up every n rows, and then the following n rows, whilst maintaining the date in the same way said previously. I have figured out a way - shown in the code below - of doing it in array form, but I don't get to keep the dates in this situation.
returns=pd.read_excel(data) # Imports the data from Excel
returns_with_dates=returns.set_index('Dates') # Sets the Dates as the df index
returns_mat=returns_with_dates.as_matrix()
ndays=int(len(returns_mat)/n) # Number of "ndays" in our time-period
nday_returns=np.empty((ndays,min(np.shape(returns_mat)))) # Creates an empty array to fill
# and accommodate the n-day log-returns
for i in range(1,asset_number+1):
for j in range(1,ndays+1):
nday_returns[j-1,i-1]=np.sum(returns_mat[(n*j)-n:n*j,i-1])
return nday_returns
Is there any way of doing this but in a DataFrame context whilst maintaining the dates in the way I said before? I've been trying to do this for sooo long without any kind of success and it's really stressing me out! For some reason everyone finds Pandas extremely useful and easy to use, but I happen to find it the opposite. Any kind of help would be very much appreciated. Thanks in advance.
groupby
df.groupby(np.arange(len(df)) // 5).sum()
To include the date index as requested
g = np.arange(len(df)) // 5
i = df.index.to_series().groupby(g).last()
df.groupby(g).sum().set_index(i)
If you have the same number of missing dates you can resample it by the number of days you desire. Using resample keeps the dates in the index. You can also use the loffset parameter to shift the dates.
df.resample('7D', loffset='6D').sum()
Company 1 Company 2 Company 3 Company 4 Company 5 \
Dates
1997-01-08 158.096150 15.757505 104.176445 162.911704 186.313282
1997-01-15 160.927550 15.966856 97.052271 174.257561 190.553344
1997-01-22 165.492461 16.250835 102.424599 181.410384 199.407588
1997-01-29 160.927549 15.608147 103.242126 175.490807 204.520604
1997-02-05 65.179991 6.159462 41.226777 68.445114 83.928272
Company 6
Dates
1997-01-08 67.905060
1997-01-15 68.820802
1997-01-22 70.305665
1997-01-29 69.612698
1997-02-05 27.840133
Related
filling and renaming a dataset at the same time
i would like to filling a dataset and making the log returns at the same time: These are the returns ret_names =['FTSEMIB_Index_ret', 'FCA_IM_Equity_ret', 'UCG_IM_Equity_ret', 'ISP_IM_Equity_ret', 'ENI_IM_Equity_ret', 'LUX_IM_Equity_ret'] and this is the Dataframe 'FTSEMIB_Index', 'FCA_IM_Equity', 'UCG_IM_Equity', 'ISP_IM_Equity','ENI_IM_Equity', 'LUX_IM_Equity' 0 22793.69 14.840 16.430 2.8860 14.040 49.24 1 22991.99 15.150 16.460 2.8780 14.220 48.98 2 23046.05 15.290 16.760 2.8660 14.300 48.70 3 23014.13 15.660 16.390 2.8500 14.380 48.72 4 23002.85 15.590 16.300 2.8420 14.500 49.13 so my idea is to use enumerate in a for loop. for index,name in enumerate(ret_names): df[name] = np.diff(np.log(df.iloc[:,index])) but i cannot match the lenght because having the returns i'm going to erase 1 value (the first one i suppose). Any idea?
maybe i found a solution, but i can't figure out why the previous one doesn't work for index,name in enumerate(ret_names): df[name] = np.log(df.iloc[:,index])/np.log(df.iloc[:,index]).shift(1) with this you can fill and assign name, and the first value will increase.
Filling in missing value based on values in both preceding and succeeding rows
I have a dataset analogous to the one below where for a website I have the number of views every month for two years (2001-2002). However, due to the way the data was gathered, I only have information for a website if it had > 0 views. So, I am trying to fill in the number of views for months where that is not the case: i.e., cases where the website was online but had no views. Unfortunately, I have no information for when the website was first published, so I assume that it was introduced the first time there are non-zero values for a month. I also assume the website was taken down if there are consecutive months with np.nan values at the end of 2002. So, currently, the Views column has np.nan values for both months where views are zero, and the website was simply not online. I want to make sure that months with zero views have 0 in the Views column, such that the below data frame, Website ,Month,Year ,Views 1,January,2001, 1,February,2001, 1,March,2001,3.0 1,April,2001,4.0 1,May,2001,23.0 1,June,2001, 1,July,2001,5.0 1,August,2001,4.0 1,September,2001,3.0 1,October,2001,3.0 1,November,2001,3.0 1,December,2001,35.0 1,January,2002,6.0 1,February,2002, 1,March,2002,3.0 1,April,2002, 1,May,2002, 1,June,2002,3.0 1,July,2002,3.0 1,August,2002,2.0 1,September,2002, 1,October,2002, 1,November,2002, 1,December,2002, 2,January,2001,3.0 2,February,2001,1.0 2,March,2001,2.0 2,April,2001,2.0 2,May,2001,22.0 2,June,2001, 2,July,2001,4.0 2,August,2001,3.0 2,September,2001,3.0 2,October,2001,4.0 2,November,2001, 2,December,2001,1.0 2,January,2002, 2,February,2002,4.0 2,March,2002,2.0 2,April,2002,5.0 2,May,2002,2.0 2,June,2002, 2,July,2002,2.0 2,August,2002,3.0 2,September,2002, 2,October,2002, 2,November,2002,2.0 2,December,2002,5.0 looks like this: Website ,Month,Year ,Views 1,January,2001, 1,February,2001, 1,March,2001,3.0 1,April,2001,4.0 1,May,2001,23.0 1,June,2001,0.0 1,July,2001,5.0 1,August,2001,4.0 1,September,2001,3.0 1,October,2001,3.0 1,November,2001,3.0 1,December,2001,35.0 1,January,2002,6.0 1,February,2002,0.0 1,March,2002,3.0 1,April,2002,0.0 1,May,2002,0.0 1,June,2002,3.0 1,July,2002,3.0 1,August,2002,2.0 1,September,2002, 1,October,2002, 1,November,2002, 1,December,2002, 2,January,2001,3.0 2,February,2001,1.0 2,March,2001,2.0 2,April,2001,2.0 2,May,2001,22.0 2,June,2001,0.0 2,July,2001,4.0 2,August,2001,3.0 2,September,2001,3.0 2,October,2001,4.0 2,November,2001,0.0 2,December,2001,1.0 2,January,2002,0.0 2,February,2002,4.0 2,March,2002,2.0 2,April,2002,5.0 2,May,2002,2.0 2,June,2002,0.0 2,July,2002,2.0 2,August,2002,3.0 2,September,2002,0.0 2,October,2002,0.0 2,November,2002,2.0 2,December,2002,5.0 In other words, if all preceding months for that website show np.nan values, and the current value is np.nan, it should remain that way. Similarly, if all following months show np.nan, the column should remain np.nan as well. However, if at least one preceding month is not np.nan the value should change to 0, etc. The tricky part is that my dataset has about 4,000,000 rows, and I need a fairly efficient way to do this. Does anyone have any suggestions?
Here's my approach # s counts the non-null views so far s = df['Views'].notnull().groupby(df['Website']).cumsum() # fill the null only where s > 0 df['Views'] = np.where(df['Views'].isna() & s.gt(0), 0, df['Views']) # equivalent # df.loc[df['View'].isna() & s.gt(0), 'Views'] = 0
I followed Quang Hoang's response and used the below code, which worked perfectly: #Same as Quang Hoang's answer: s = df['Views'].notnull().groupby(df['Website']).cumsum() #Count the non-null views so far but starting with the last observations b = df['Views'].notnull()[::-1].groupby(df['Website']).cumsum() # fill the null only where s > 0 and b > 0 df['Views'] = np.where(df['Views'].isna() & (s.gt(0) & b.gt(0)), 0, df['Views'])
Convert quandl fx hourly data to daily
I would like to convert hourly financial data imported into a pandas dataframe that has the following csv header to daily data: symbol,date,hour,openbid,highbid,lowbid,closebid,openask,highask,lowask,closeask,totalticks I've imported the data with pandas.read_csv(). I have eliminated all but one symbol from the data for testing purposes, and have figured out this part so far: df.groupby('date').agg({'highask': [max], 'lowask': [min]}) I'm still pretty new to python, so I'm not really sure how to continue. I'm guessing I can use some kind of anonymous function to create additional fields. For example, I'd like to get the open ask price for each date at hour 0, and the close ask price for each data at hour 23. Ideally, I would add additional columns and create a new dataframe. I want add a new column for market price, that is a an average of the ask/bid for low, high, open, and close. Any advice would be greatly appreciated. Thanks! edit As requested, here is the output I would expect for just 2018-07-24: symbol,date,openbid,highbid,lowbid,closebid,openask,highask,lowask,closeask,totalticks AUD/USD,2018-07-24,0.7422,0.74297,0.7429,0.74196,0.74257,0.743,0.74197,0.74258,5191 openbid is the openbid at the lowest hour column for a single date, closebid is the closebid at the highest hour for a single date, etc. Total ticks is the sum. What I am really struggling with is determining openbid, openask, closebid, and closeask. Sample data: symbol,date,hour,openbid,highbid,lowbid,closebid,openask,highask,lowask,closeask,totalticks AUD/USD,2018-07-24,22,0.7422,0.74249,0.74196,0.7423,0.74225,0.74252,0.74197,0.74234,1470 AUD/USD,2018-07-24,23,0.7423,0.74297,0.7423,0.74257,0.74234,0.743,0.74234,0.74258,3721 AUD/USD,2018-07-25,0,0.74257,0.74334,0.74237,0.74288,0.74258,0.74335,0.74239,0.74291,7443 AUD/USD,2018-07-25,1,0.74288,0.74492,0.74105,0.74111,0.74291,0.74501,0.74107,0.74111,14691 AUD/USD,2018-07-25,2,0.74111,0.74127,0.74015,0.74073,0.74111,0.74129,0.74018,0.74076,6898 AUD/USD,2018-07-25,3,0.74073,0.74076,0.73921,0.73987,0.74076,0.74077,0.73923,0.73989,6207 AUD/USD,2018-07-25,4,0.73987,0.74002,0.73921,0.73953,0.73989,0.74003,0.73923,0.73956,3453 AUD/USD,2018-07-25,5,0.73953,0.74094,0.73946,0.74041,0.73956,0.74096,0.73947,0.74042,7187 AUD/USD,2018-07-25,6,0.74041,0.74071,0.73921,0.74056,0.74042,0.74069,0.73922,0.74059,10646 AUD/USD,2018-07-25,7,0.74056,0.74066,0.73973,0.74035,0.74059,0.74068,0.73974,0.74037,9285 AUD/USD,2018-07-25,8,0.74035,0.74206,0.73996,0.74198,0.74037,0.74207,0.73998,0.742,10234 AUD/USD,2018-07-25,9,0.74198,0.74274,0.74176,0.74225,0.742,0.74275,0.74179,0.74227,8224 AUD/USD,2018-07-25,10,0.74225,0.74237,0.74122,0.74142,0.74227,0.74237,0.74124,0.74143,7143 AUD/USD,2018-07-25,11,0.74142,0.74176,0.74093,0.74152,0.74143,0.74176,0.74095,0.74152,7307 AUD/USD,2018-07-25,12,0.74152,0.74229,0.74078,0.74219,0.74152,0.74229,0.74079,0.74222,10523 AUD/USD,2018-07-25,13,0.74219,0.74329,0.74138,0.74141,0.74222,0.74332,0.74136,0.74145,13983 AUD/USD,2018-07-25,14,0.74141,0.74217,0.74032,0.74065,0.74145,0.7422,0.74034,0.74067,21814 AUD/USD,2018-07-25,15,0.74065,0.74151,0.73989,0.74113,0.74067,0.74152,0.73988,0.74115,16085 AUD/USD,2018-07-25,16,0.74113,0.74144,0.74056,0.7411,0.74115,0.74146,0.74058,0.74111,7752 AUD/USD,2018-07-25,17,0.7411,0.7435,0.74092,0.74346,0.74111,0.74353,0.74094,0.74348,11348 AUD/USD,2018-07-25,18,0.74346,0.74445,0.74331,0.74373,0.74348,0.74446,0.74333,0.74373,9898 AUD/USD,2018-07-25,19,0.74373,0.74643,0.74355,0.74559,0.74373,0.74643,0.74358,0.7456,11756 AUD/USD,2018-07-25,20,0.74559,0.74596,0.74478,0.74549,0.7456,0.746,0.74481,0.74562,5607 AUD/USD,2018-07-25,21,0.74549,0.74562,0.74417,0.74438,0.74562,0.74576,0.74422,0.74442,3613 AUD/USD,2018-07-26,22,0.73762,0.73792,0.73762,0.73774,0.73772,0.73798,0.73768,0.73779,1394 AUD/USD,2018-07-26,23,0.73774,0.73813,0.73744,0.73807,0.73779,0.73816,0.73746,0.73808,3465 AUD/USD,2018-07-27,0,0.73807,0.73826,0.73733,0.73763,0.73808,0.73828,0.73735,0.73764,6582 AUD/USD,2018-07-27,1,0.73763,0.73854,0.73734,0.73789,0.73764,0.73857,0.73736,0.73788,7373 AUD/USD,2018-07-27,2,0.73789,0.73881,0.73776,0.73881,0.73788,0.73883,0.73778,0.73882,3414 AUD/USD,2018-07-27,3,0.73881,0.7393,0.73849,0.73875,0.73882,0.73932,0.73851,0.73877,4639 AUD/USD,2018-07-27,4,0.73875,0.739,0.73852,0.73858,0.73877,0.73901,0.73852,0.73859,2487 AUD/USD,2018-07-27,5,0.73858,0.73896,0.7381,0.73887,0.73859,0.73896,0.73812,0.73888,5332 AUD/USD,2018-07-27,6,0.73887,0.73902,0.73792,0.73879,0.73888,0.73902,0.73793,0.73881,7623 AUD/USD,2018-07-27,7,0.73879,0.7395,0.73844,0.73885,0.73881,0.7395,0.73846,0.73887,9577 AUD/USD,2018-07-27,8,0.73885,0.73897,0.73701,0.73727,0.73887,0.73899,0.73702,0.73729,12280 AUD/USD,2018-07-27,9,0.73727,0.73784,0.737,0.73721,0.73729,0.73786,0.73701,0.73723,8634 AUD/USD,2018-07-27,10,0.73721,0.73798,0.73717,0.73777,0.73723,0.73798,0.73718,0.73779,7510 AUD/USD,2018-07-27,11,0.73777,0.73789,0.73728,0.73746,0.73779,0.73789,0.7373,0.73745,4947 AUD/USD,2018-07-27,12,0.73746,0.73927,0.73728,0.73888,0.73745,0.73929,0.73729,0.73891,16853 AUD/USD,2018-07-27,13,0.73888,0.74083,0.73853,0.74066,0.73891,0.74083,0.73855,0.74075,14412 AUD/USD,2018-07-27,14,0.74066,0.74147,0.74025,0.74062,0.74075,0.74148,0.74026,0.74064,15187 AUD/USD,2018-07-27,15,0.74062,0.74112,0.74002,0.74084,0.74064,0.74114,0.74003,0.74086,10044 AUD/USD,2018-07-27,16,0.74084,0.74091,0.73999,0.74001,0.74086,0.74092,0.74,0.74003,6893 AUD/USD,2018-07-27,17,0.74001,0.74022,0.73951,0.74008,0.74003,0.74025,0.73952,0.74009,5865 AUD/USD,2018-07-27,18,0.74008,0.74061,0.74002,0.74046,0.74009,0.74062,0.74004,0.74047,4334 AUD/USD,2018-07-27,19,0.74046,0.74072,0.74039,0.74041,0.74047,0.74073,0.74041,0.74043,3654 AUD/USD,2018-07-27,20,0.74041,0.74066,0.74005,0.74011,0.74043,0.74068,0.74018,0.74023,1547 AUD/USD,2018-07-25,22,0.74438,0.74526,0.74436,0.74489,0.74442,0.7453,0.74439,0.74494,2220 AUD/USD,2018-07-25,23,0.74489,0.74612,0.74489,0.7459,0.74494,0.74612,0.74492,0.74592,4886 AUD/USD,2018-07-26,0,0.7459,0.74625,0.74536,0.74571,0.74592,0.74623,0.74536,0.74573,6602 AUD/USD,2018-07-26,1,0.74571,0.74633,0.74472,0.74479,0.74573,0.74634,0.74471,0.74481,10123 AUD/USD,2018-07-26,2,0.74479,0.74485,0.74375,0.74434,0.74481,0.74487,0.74378,0.74437,7844 AUD/USD,2018-07-26,3,0.74434,0.74459,0.74324,0.744,0.74437,0.74461,0.74328,0.744,6037 AUD/USD,2018-07-26,4,0.744,0.74428,0.74378,0.74411,0.744,0.7443,0.74379,0.74414,3757 AUD/USD,2018-07-26,5,0.74411,0.74412,0.74346,0.74349,0.74414,0.74414,0.74344,0.74349,5713 AUD/USD,2018-07-26,6,0.74349,0.74462,0.74291,0.74299,0.74349,0.74464,0.74293,0.743,12650 AUD/USD,2018-07-26,7,0.74299,0.74363,0.74267,0.74361,0.743,0.74363,0.74269,0.74362,8067 AUD/USD,2018-07-26,8,0.74361,0.74375,0.74279,0.74287,0.74362,0.74376,0.7428,0.74288,6988 AUD/USD,2018-07-26,9,0.74287,0.74322,0.74212,0.74318,0.74288,0.74323,0.74212,0.74319,7784 AUD/USD,2018-07-26,10,0.74318,0.74329,0.74249,0.74276,0.74319,0.74331,0.7425,0.74276,5271 AUD/USD,2018-07-26,11,0.74276,0.74301,0.74179,0.74201,0.74276,0.74303,0.7418,0.74199,7434 AUD/USD,2018-07-26,12,0.74201,0.74239,0.74061,0.74064,0.74199,0.74241,0.74063,0.74066,20513 AUD/USD,2018-07-26,13,0.74064,0.74124,0.73942,0.74008,0.74066,0.74124,0.73943,0.74005,19715 AUD/USD,2018-07-26,14,0.74008,0.74014,0.73762,0.73887,0.74005,0.74013,0.73764,0.73889,21137 AUD/USD,2018-07-26,15,0.73887,0.73936,0.73823,0.73831,0.73889,0.73936,0.73824,0.73833,11186 AUD/USD,2018-07-26,16,0.73831,0.73915,0.73816,0.73908,0.73833,0.73916,0.73817,0.73908,6016 AUD/USD,2018-07-26,17,0.73908,0.73914,0.73821,0.73884,0.73908,0.73917,0.73823,0.73887,6197 AUD/USD,2018-07-26,18,0.73884,0.73885,0.73737,0.73773,0.73887,0.73887,0.73737,0.73775,6127 AUD/USD,2018-07-26,19,0.73773,0.73794,0.73721,0.73748,0.73775,0.73797,0.73724,0.73751,3614 AUD/USD,2018-07-26,20,0.73748,0.73787,0.73746,0.73767,0.73751,0.7379,0.73748,0.73773,1801 AUD/USD,2018-07-26,21,0.73767,0.73807,0.73755,0.73762,0.73773,0.73836,0.73769,0.73772,1687
To assign a new column avg_market_price as the average: df = df.assign(avg_market_price=df[['openbid', 'highbid', 'lowbid', 'closebid', 'openask', 'highask', 'lowask', 'closeask']].mean(axis=1)) You then want to set the index to a datetime index by combining the date and time fields, then resample your data to daily time periods (1d). Finally, use apply to get the max, min and averge values on specific columns. import numpy as np >>> (df .set_index(df['date'] + pd.to_timedelta(df['hour'], unit='h')) .resample('1d') .apply({'highask': 'max', 'lowask': 'min', 'avg_market_price': np.mean})) highask lowask avg_market_price 2018-07-24 0.74300 0.74197 0.742402 2018-07-25 0.74643 0.73922 0.742142 2018-07-26 0.74634 0.73724 0.741239 2018-07-27 0.74148 0.73701 0.739011
transform raw date format into pandas date object
I have a CSV file which looks like this: time, Numbers [30/Apr/1998:21:30:17,24736 [30/Apr/1998:21:30:53,24736 [30/Apr/1998:21:31:12,24736 [30/Apr/1998:21:31:19,3781 [30/Apr/1998:21:31:22,- [30/Apr/1998:21:31:27,24736 [30/Apr/1998:21:31:29,- [30/Apr/1998:21:31:29,- [30/Apr/1998:21:31:32,929 [30/Apr/1998:21:31:43,- [30/Apr/1998:21:31:44,1139 [30/Apr/1998:21:31:52,24736 [30/Apr/1998:21:31:52,3029 [30/Apr/1998:21:32:06,24736 [30/Apr/1998:21:32:16,- [30/Apr/1998:21:32:16,- [30/Apr/1998:21:32:17,- [30/Apr/1998:21:32:30,14521 [30/Apr/1998:21:32:33,11324 [30/Apr/1998:21:32:35,24736 [30/Apr/1998:21:32:3l8,671 [30/Apr/1998:21:32:38,1512 [30/Apr/1998:21:32:38,1136 [30/Apr/1998:21:32:38,1647 [30/Apr/1998:21:32:38,1271 [30/Apr/1998:21:32:52,5933 [30/Apr/1998:21:32:58,- [30/Apr/1998:21:32:59,231 upto one billion, forget about numbers column, I have a concern to convert this time-date format in my CSV file to pandas time stamp, so I can plot my dataset and visualize it according to time, as I am new in datascience,here is my approach: step 1: take all the time colum from my CSV file into an array, step 2: split the data from the mid where :(colon) occurs, make two new arrays of date and time, step 3: remove "[" from date array, step 4: replace all forward slash into dashes in the date array, step 5: and then append date and time array to make a single pandas format, which will be looks like this, 2017-03-22 15:16:45 as you known that I am new and my approach is naive and also wrong, if someone can help me with providing me code snippet, I will be really happy, thanks
You can pass a format to pd.to_datetime(), in this case: [%d/%b/%Y:%H:%M:%S. Be careful with erroneous data though as seen in row 3 in sample data below ([30/Apr/1998:21:32:3l8,671). To not get an error you can pass errors=coerce, will return Not a Time (NaT). The other way would be to replace those rows manually or write some sort of regex/replace funtion first. import pandas as pd data = '''\ time, Numbers [30/Apr/1998:21:30:17,24736 [30/Apr/1998:21:30:53,24736 [30/Apr/1998:21:32:3l8,671 [30/Apr/1998:21:32:38,1512 [30/Apr/1998:21:32:38,1136 [30/Apr/1998:21:32:58,- [30/Apr/1998:21:32:59,231''' fileobj = pd.compat.StringIO(data) df = pd.read_csv(fileobj, sep=',', na_values=['-']) df['time'] = pd.to_datetime(df['time'], format='[%d/%b/%Y:%H:%M:%S', errors='coerce') print(df) Returns: time Numbers 0 1998-04-30 21:30:17 24736.0 1 1998-04-30 21:30:53 24736.0 2 NaT 671.0 3 1998-04-30 21:32:38 1512.0 4 1998-04-30 21:32:38 1136.0 5 1998-04-30 21:32:58 NaN 6 1998-04-30 21:32:59 231.0 Note that: na_values=['-'] was used here to help pandas understand the Numbers column is actually numbers and not strings. And now we can perform actions like grouping (on minute for instance): print(df.groupby(df.time.dt.minute)['Numbers'].mean()) #time #30.0 24736.000000 #32.0 959.666667
Slicing my data frame is returning unexpected results
I have 13 CSV files that contain billing information in an unusual format. Multiple readings are recorded every 30 minutes of the day. Five days are recorded beside each other (columns). Then the next five days are recorded under it. To make things more complicated, the day of the week, date, and billing day is shown over the first recording of KVAR each day. The image blow shows a small example. However, imagine that KW, KVAR, and KVA repeat 3 more times before continuing some 50 rows later. My goal as to create a simple python script that would make the data into a data frame with the columns: DATE, TIME, KW, KVAR, KVA, and DAY. The problem is my script returns NaN data for the KW, KVAR, and KVA data after the first five days (which is correlated with a new instance of a for loop). What is weird to me is that when I try to print out the same ranges I get the data that I expect. My code is below. I have included comments to help further explain things. I also have an example of sample output of my function. def make_df(df): #starting values output = pd.DataFrame(columns=["DATE", "TIME", "KW", "KVAR", "KVA", "DAY"]) time = df1.loc[3:50,0] val_start = 3 val_end = 51 date_val = [0,2] day_type = [1,2] # There are 7 row movements that need to take place. for row_move in range(1,8): day = [1,2,3] date_val[1] = 2 day_type[1] = 2 # There are 5 column movements that take place. # The basic idea is that I would cycle through the five days, grab their data in a temporary dataframe, # and then append that dataframe onto the output dataframe for col_move in range(1,6): temp_df = pd.DataFrame(columns=["DATE", "TIME", "KW", "KVAR", "KVA", "DAY"]) temp_df['TIME'] = time #These are the 3 values that stop working after the first column change # I get the values that I expect for the first 5 days temp_df['KW'] = df.iloc[val_start:val_end, day[0]] temp_df['KVAR'] = df.iloc[val_start:val_end, day[1]] temp_df['KVA'] = df.iloc[val_start:val_end, day[2]] # These 2 values work perfectly for the entire data set temp_df['DAY'] = df.iloc[day_type[0], day_type[1]] temp_df["DATE"] = df.iloc[date_val[0], date_val[1]] # trouble shooting print(df.iloc[val_start:val_end, day[0]]) print(temp_df) output = output.append(temp_df) # increase values for each iteration of row loop. # seems to work perfectly when I print the data day = [x + 3 for x in day] date_val[1] = date_val[1] + 3 day_type[1] = day_type[1] + 3 # increase values for each iteration of column loop # seems to work perfectly when I print the data date_val[0] = date_val[0] + 55 day_type [0]= day_type[0] + 55 val_start = val_start + 55 val_end = val_end + 55 return output test = make_df(df1) Below is some sample output. It shows where the data starts to break down after the fifth day (or first instance of the column shift in the for loop). What am I doing wrong?
Could be pd.append requiring matched row indices for numerical values. import pandas as pd import numpy as np output = pd.DataFrame(np.random.rand(5,2), columns=['a','b']) # fake data output['c'] = list('abcdefghij') # add a column of non-numerical entries tmp = pd.DataFrame(columns=['a','b','c']) tmp['a'] = output.iloc[0:2, 2] tmp['b'] = output.iloc[3:5, 2] # generates NaN tmp['c'] = output.iloc[0:2, 2] data.append(tmp) (initial response) How does df1 look like? Is df.iloc[val_start:val_end, day[0]] have any issue past the fifth day? The codes didn't show how you read from the csv files, or df1 itself. My guess: if val_start:val_end gives invalid indices on the sixth day, or df1 happens to be malformed past the fifth day, df.iloc[val_start:val_end, day[0]] will return an empty Series object and possibly make its way into temp_df. iloc do not report invalid row indices, though similar column indices would trigger IndexError. import numpy as np import pandas as pd df = pd.DataFrame(np.random.rand(5,3), columns=['a','b','c'], index=np.arange(5)) # fake data df.iloc[0:2, 1] # returns the subset df.iloc[100:102, 1] # returns: Series([], Name: b, dtype: float64) A little off topic but I would recommend preprocessing the csv files rather than deal with indexing in Pandas DataFrame, as the original format was kinda complex. Slice the data by date and later use pd.melt or pd.groupby to shape them into the format you like. Or alternatively try multi-index if stick with Pandas I/O.