Difficult adding up elements in a pandas DataFrame - python

I'm currently having trouble adding up the rows for the following DataFrame which I have constructed for the returns for six companies' stocks:
def importdata(data):
returns=pd.read_excel(data) # Imports the data from Excel
returns_with_dates=returns.set_index('Dates') # Sets the Dates as the df index
return returns_with_dates
which outputs:
Out[345]:
Company 1 Company 2 Company 3 Company 4 Company 5 Company 6
Dates
1997-01-02 31.087620 3.094705 24.058686 31.694404 37.162890 13.462241
1997-01-03 31.896592 3.109631 22.423629 32.064378 37.537013 13.511706
1997-01-06 31.723241 3.184358 18.803148 32.681000 37.038183 13.684925
1997-01-07 31.781024 3.199380 19.503886 33.544272 37.038183 13.660193
1997-01-08 31.607673 3.169431 19.387096 32.927650 37.537013 13.585995
1997-01-09 31.492106 3.199380 19.737465 33.420948 37.038183 13.759214
1997-01-10 32.589996 3.184358 19.270307 34.284219 37.661721 13.858235
1997-01-13 32.416645 3.199380 19.153517 35.147491 38.035844 13.660193
1997-01-14 32.301077 3.184358 19.503886 35.517465 39.407629 13.783946
1997-01-15 32.127726 3.199380 19.387096 35.887438 38.409967 13.759214
1997-01-16 32.532212 3.229232 19.737465 36.257412 39.282921 13.635460
1997-01-17 33.167833 3.259180 20.087835 37.490657 39.033505 13.858235
1997-01-20 33.456751 3.229232 20.438204 35.640789 39.657044 14.377892
1997-01-21 33.225616 3.244158 20.671783 36.010763 40.779413 14.179940
1997-01-22 33.110049 3.289033 21.489312 36.010763 40.654705 14.254138
1997-01-23 32.705563 3.199380 20.905363 35.394140 40.904121 14.229405
1997-01-24 32.127726 3.139579 20.204624 35.764114 40.405290 13.957165
1997-01-27 32.127726 3.094705 20.204624 35.270816 40.779413 13.882968
1997-01-28 31.781024 3.079778 20.788573 34.407544 41.153536 13.684925
1997-01-29 32.185510 3.094705 21.138942 34.654193 41.278244 13.858235
1997-01-30 32.647779 3.094705 21.022153 34.407544 41.652367 13.981898
1997-01-31 32.532212 3.064757 20.204624 34.037570 42.275905 13.858235
For countless hours I have tried summing them up in such a way that I add up the rows from 1997-01-02 to 1997-01-08, 1997-01-09 to 1997-01-15 etc., thus adding up the first five rows, and then the following five rows. Furthermore, I seek to keep the date as an index for the 5th element, so in the case of adding up the elements from 1997-01-02 to 1997-01-08 I seek to keep 1997-01-08 as the index corresponding to the summed up element. It is worth mentioning that I have been using the five row addition as an example, but ideally I seek to add up every n rows, and then the following n rows, whilst maintaining the date in the same way said previously. I have figured out a way - shown in the code below - of doing it in array form, but I don't get to keep the dates in this situation.
returns=pd.read_excel(data) # Imports the data from Excel
returns_with_dates=returns.set_index('Dates') # Sets the Dates as the df index
returns_mat=returns_with_dates.as_matrix()
ndays=int(len(returns_mat)/n) # Number of "ndays" in our time-period
nday_returns=np.empty((ndays,min(np.shape(returns_mat)))) # Creates an empty array to fill
# and accommodate the n-day log-returns
for i in range(1,asset_number+1):
for j in range(1,ndays+1):
nday_returns[j-1,i-1]=np.sum(returns_mat[(n*j)-n:n*j,i-1])
return nday_returns
Is there any way of doing this but in a DataFrame context whilst maintaining the dates in the way I said before? I've been trying to do this for sooo long without any kind of success and it's really stressing me out! For some reason everyone finds Pandas extremely useful and easy to use, but I happen to find it the opposite. Any kind of help would be very much appreciated. Thanks in advance.

groupby
df.groupby(np.arange(len(df)) // 5).sum()
To include the date index as requested
g = np.arange(len(df)) // 5
i = df.index.to_series().groupby(g).last()
df.groupby(g).sum().set_index(i)

If you have the same number of missing dates you can resample it by the number of days you desire. Using resample keeps the dates in the index. You can also use the loffset parameter to shift the dates.
df.resample('7D', loffset='6D').sum()
Company 1 Company 2 Company 3 Company 4 Company 5 \
Dates
1997-01-08 158.096150 15.757505 104.176445 162.911704 186.313282
1997-01-15 160.927550 15.966856 97.052271 174.257561 190.553344
1997-01-22 165.492461 16.250835 102.424599 181.410384 199.407588
1997-01-29 160.927549 15.608147 103.242126 175.490807 204.520604
1997-02-05 65.179991 6.159462 41.226777 68.445114 83.928272
Company 6
Dates
1997-01-08 67.905060
1997-01-15 68.820802
1997-01-22 70.305665
1997-01-29 69.612698
1997-02-05 27.840133

Related

filling and renaming a dataset at the same time

i would like to filling a dataset and making the log returns at the same time:
These are the returns
ret_names =['FTSEMIB_Index_ret', 'FCA_IM_Equity_ret', 'UCG_IM_Equity_ret', 'ISP_IM_Equity_ret',
'ENI_IM_Equity_ret',
'LUX_IM_Equity_ret']
and this is the Dataframe
'FTSEMIB_Index', 'FCA_IM_Equity', 'UCG_IM_Equity', 'ISP_IM_Equity','ENI_IM_Equity', 'LUX_IM_Equity'
0 22793.69 14.840 16.430 2.8860 14.040 49.24
1 22991.99 15.150 16.460 2.8780 14.220 48.98
2 23046.05 15.290 16.760 2.8660 14.300 48.70
3 23014.13 15.660 16.390 2.8500 14.380 48.72
4 23002.85 15.590 16.300 2.8420 14.500 49.13
so my idea is to use enumerate in a for loop.
for index,name in enumerate(ret_names):
df[name] = np.diff(np.log(df.iloc[:,index]))
but i cannot match the lenght because having the returns i'm going to erase 1 value (the first one i suppose).
Any idea?
maybe i found a solution, but i can't figure out why the previous one doesn't work
for index,name in enumerate(ret_names):
df[name] = np.log(df.iloc[:,index])/np.log(df.iloc[:,index]).shift(1)
with this you can fill and assign name, and the first value will increase.

Filling in missing value based on values in both preceding and succeeding rows

I have a dataset analogous to the one below where for a website I have the number of views every month for two years (2001-2002). However, due to the way the data was gathered, I only have information for a website if it had > 0 views. So, I am trying to fill in the number of views for months where that is not the case: i.e., cases where the website was online but had no views.
Unfortunately, I have no information for when the website was first published, so I assume that it was introduced the first time there are non-zero values for a month. I also assume the website was taken down if there are consecutive months with np.nan values at the end of 2002.
So, currently, the Views column has np.nan values for both months where views are zero, and the website was simply not online.
I want to make sure that months with zero views have 0 in the Views column, such that the below data frame,
Website ,Month,Year ,Views
1,January,2001,
1,February,2001,
1,March,2001,3.0
1,April,2001,4.0
1,May,2001,23.0
1,June,2001,
1,July,2001,5.0
1,August,2001,4.0
1,September,2001,3.0
1,October,2001,3.0
1,November,2001,3.0
1,December,2001,35.0
1,January,2002,6.0
1,February,2002,
1,March,2002,3.0
1,April,2002,
1,May,2002,
1,June,2002,3.0
1,July,2002,3.0
1,August,2002,2.0
1,September,2002,
1,October,2002,
1,November,2002,
1,December,2002,
2,January,2001,3.0
2,February,2001,1.0
2,March,2001,2.0
2,April,2001,2.0
2,May,2001,22.0
2,June,2001,
2,July,2001,4.0
2,August,2001,3.0
2,September,2001,3.0
2,October,2001,4.0
2,November,2001,
2,December,2001,1.0
2,January,2002,
2,February,2002,4.0
2,March,2002,2.0
2,April,2002,5.0
2,May,2002,2.0
2,June,2002,
2,July,2002,2.0
2,August,2002,3.0
2,September,2002,
2,October,2002,
2,November,2002,2.0
2,December,2002,5.0
looks like this:
Website ,Month,Year ,Views
1,January,2001,
1,February,2001,
1,March,2001,3.0
1,April,2001,4.0
1,May,2001,23.0
1,June,2001,0.0
1,July,2001,5.0
1,August,2001,4.0
1,September,2001,3.0
1,October,2001,3.0
1,November,2001,3.0
1,December,2001,35.0
1,January,2002,6.0
1,February,2002,0.0
1,March,2002,3.0
1,April,2002,0.0
1,May,2002,0.0
1,June,2002,3.0
1,July,2002,3.0
1,August,2002,2.0
1,September,2002,
1,October,2002,
1,November,2002,
1,December,2002,
2,January,2001,3.0
2,February,2001,1.0
2,March,2001,2.0
2,April,2001,2.0
2,May,2001,22.0
2,June,2001,0.0
2,July,2001,4.0
2,August,2001,3.0
2,September,2001,3.0
2,October,2001,4.0
2,November,2001,0.0
2,December,2001,1.0
2,January,2002,0.0
2,February,2002,4.0
2,March,2002,2.0
2,April,2002,5.0
2,May,2002,2.0
2,June,2002,0.0
2,July,2002,2.0
2,August,2002,3.0
2,September,2002,0.0
2,October,2002,0.0
2,November,2002,2.0
2,December,2002,5.0
In other words, if all preceding months for that website show np.nan values, and the current value is np.nan, it should remain that way. Similarly, if all following months show np.nan, the column should remain np.nan as well. However, if at least one preceding month is not np.nan the value should change to 0, etc.
The tricky part is that my dataset has about 4,000,000 rows, and I need a fairly efficient way to do this.
Does anyone have any suggestions?
Here's my approach
# s counts the non-null views so far
s = df['Views'].notnull().groupby(df['Website']).cumsum()
# fill the null only where s > 0
df['Views'] = np.where(df['Views'].isna() & s.gt(0), 0, df['Views'])
# equivalent
# df.loc[df['View'].isna() & s.gt(0), 'Views'] = 0
I followed Quang Hoang's response and used the below code, which worked perfectly:
#Same as Quang Hoang's answer:
s = df['Views'].notnull().groupby(df['Website']).cumsum()
#Count the non-null views so far but starting with the last observations
b = df['Views'].notnull()[::-1].groupby(df['Website']).cumsum()
# fill the null only where s > 0 and b > 0
df['Views'] = np.where(df['Views'].isna() & (s.gt(0) & b.gt(0)), 0, df['Views'])

Convert quandl fx hourly data to daily

I would like to convert hourly financial data imported into a pandas dataframe that has the following csv header to daily data:
symbol,date,hour,openbid,highbid,lowbid,closebid,openask,highask,lowask,closeask,totalticks
I've imported the data with pandas.read_csv(). I have eliminated all but one symbol from the data for testing purposes, and have figured out this part so far:
df.groupby('date').agg({'highask': [max], 'lowask': [min]})
I'm still pretty new to python, so I'm not really sure how to continue. I'm guessing I can use some kind of anonymous function to create additional fields. For example, I'd like to get the open ask price for each date at hour 0, and the close ask price for each data at hour 23. Ideally, I would add additional columns and create a new dataframe. I want add a new column for market price, that is a an average of the ask/bid for low, high, open, and close.
Any advice would be greatly appreciated. Thanks!
edit
As requested, here is the output I would expect for just 2018-07-24:
symbol,date,openbid,highbid,lowbid,closebid,openask,highask,lowask,closeask,totalticks
AUD/USD,2018-07-24,0.7422,0.74297,0.7429,0.74196,0.74257,0.743,0.74197,0.74258,5191
openbid is the openbid at the lowest hour column for a single date, closebid is the closebid at the highest hour for a single date, etc. Total ticks is the sum. What I am really struggling with is determining openbid, openask, closebid, and closeask.
Sample data:
symbol,date,hour,openbid,highbid,lowbid,closebid,openask,highask,lowask,closeask,totalticks
AUD/USD,2018-07-24,22,0.7422,0.74249,0.74196,0.7423,0.74225,0.74252,0.74197,0.74234,1470
AUD/USD,2018-07-24,23,0.7423,0.74297,0.7423,0.74257,0.74234,0.743,0.74234,0.74258,3721
AUD/USD,2018-07-25,0,0.74257,0.74334,0.74237,0.74288,0.74258,0.74335,0.74239,0.74291,7443
AUD/USD,2018-07-25,1,0.74288,0.74492,0.74105,0.74111,0.74291,0.74501,0.74107,0.74111,14691
AUD/USD,2018-07-25,2,0.74111,0.74127,0.74015,0.74073,0.74111,0.74129,0.74018,0.74076,6898
AUD/USD,2018-07-25,3,0.74073,0.74076,0.73921,0.73987,0.74076,0.74077,0.73923,0.73989,6207
AUD/USD,2018-07-25,4,0.73987,0.74002,0.73921,0.73953,0.73989,0.74003,0.73923,0.73956,3453
AUD/USD,2018-07-25,5,0.73953,0.74094,0.73946,0.74041,0.73956,0.74096,0.73947,0.74042,7187
AUD/USD,2018-07-25,6,0.74041,0.74071,0.73921,0.74056,0.74042,0.74069,0.73922,0.74059,10646
AUD/USD,2018-07-25,7,0.74056,0.74066,0.73973,0.74035,0.74059,0.74068,0.73974,0.74037,9285
AUD/USD,2018-07-25,8,0.74035,0.74206,0.73996,0.74198,0.74037,0.74207,0.73998,0.742,10234
AUD/USD,2018-07-25,9,0.74198,0.74274,0.74176,0.74225,0.742,0.74275,0.74179,0.74227,8224
AUD/USD,2018-07-25,10,0.74225,0.74237,0.74122,0.74142,0.74227,0.74237,0.74124,0.74143,7143
AUD/USD,2018-07-25,11,0.74142,0.74176,0.74093,0.74152,0.74143,0.74176,0.74095,0.74152,7307
AUD/USD,2018-07-25,12,0.74152,0.74229,0.74078,0.74219,0.74152,0.74229,0.74079,0.74222,10523
AUD/USD,2018-07-25,13,0.74219,0.74329,0.74138,0.74141,0.74222,0.74332,0.74136,0.74145,13983
AUD/USD,2018-07-25,14,0.74141,0.74217,0.74032,0.74065,0.74145,0.7422,0.74034,0.74067,21814
AUD/USD,2018-07-25,15,0.74065,0.74151,0.73989,0.74113,0.74067,0.74152,0.73988,0.74115,16085
AUD/USD,2018-07-25,16,0.74113,0.74144,0.74056,0.7411,0.74115,0.74146,0.74058,0.74111,7752
AUD/USD,2018-07-25,17,0.7411,0.7435,0.74092,0.74346,0.74111,0.74353,0.74094,0.74348,11348
AUD/USD,2018-07-25,18,0.74346,0.74445,0.74331,0.74373,0.74348,0.74446,0.74333,0.74373,9898
AUD/USD,2018-07-25,19,0.74373,0.74643,0.74355,0.74559,0.74373,0.74643,0.74358,0.7456,11756
AUD/USD,2018-07-25,20,0.74559,0.74596,0.74478,0.74549,0.7456,0.746,0.74481,0.74562,5607
AUD/USD,2018-07-25,21,0.74549,0.74562,0.74417,0.74438,0.74562,0.74576,0.74422,0.74442,3613
AUD/USD,2018-07-26,22,0.73762,0.73792,0.73762,0.73774,0.73772,0.73798,0.73768,0.73779,1394
AUD/USD,2018-07-26,23,0.73774,0.73813,0.73744,0.73807,0.73779,0.73816,0.73746,0.73808,3465
AUD/USD,2018-07-27,0,0.73807,0.73826,0.73733,0.73763,0.73808,0.73828,0.73735,0.73764,6582
AUD/USD,2018-07-27,1,0.73763,0.73854,0.73734,0.73789,0.73764,0.73857,0.73736,0.73788,7373
AUD/USD,2018-07-27,2,0.73789,0.73881,0.73776,0.73881,0.73788,0.73883,0.73778,0.73882,3414
AUD/USD,2018-07-27,3,0.73881,0.7393,0.73849,0.73875,0.73882,0.73932,0.73851,0.73877,4639
AUD/USD,2018-07-27,4,0.73875,0.739,0.73852,0.73858,0.73877,0.73901,0.73852,0.73859,2487
AUD/USD,2018-07-27,5,0.73858,0.73896,0.7381,0.73887,0.73859,0.73896,0.73812,0.73888,5332
AUD/USD,2018-07-27,6,0.73887,0.73902,0.73792,0.73879,0.73888,0.73902,0.73793,0.73881,7623
AUD/USD,2018-07-27,7,0.73879,0.7395,0.73844,0.73885,0.73881,0.7395,0.73846,0.73887,9577
AUD/USD,2018-07-27,8,0.73885,0.73897,0.73701,0.73727,0.73887,0.73899,0.73702,0.73729,12280
AUD/USD,2018-07-27,9,0.73727,0.73784,0.737,0.73721,0.73729,0.73786,0.73701,0.73723,8634
AUD/USD,2018-07-27,10,0.73721,0.73798,0.73717,0.73777,0.73723,0.73798,0.73718,0.73779,7510
AUD/USD,2018-07-27,11,0.73777,0.73789,0.73728,0.73746,0.73779,0.73789,0.7373,0.73745,4947
AUD/USD,2018-07-27,12,0.73746,0.73927,0.73728,0.73888,0.73745,0.73929,0.73729,0.73891,16853
AUD/USD,2018-07-27,13,0.73888,0.74083,0.73853,0.74066,0.73891,0.74083,0.73855,0.74075,14412
AUD/USD,2018-07-27,14,0.74066,0.74147,0.74025,0.74062,0.74075,0.74148,0.74026,0.74064,15187
AUD/USD,2018-07-27,15,0.74062,0.74112,0.74002,0.74084,0.74064,0.74114,0.74003,0.74086,10044
AUD/USD,2018-07-27,16,0.74084,0.74091,0.73999,0.74001,0.74086,0.74092,0.74,0.74003,6893
AUD/USD,2018-07-27,17,0.74001,0.74022,0.73951,0.74008,0.74003,0.74025,0.73952,0.74009,5865
AUD/USD,2018-07-27,18,0.74008,0.74061,0.74002,0.74046,0.74009,0.74062,0.74004,0.74047,4334
AUD/USD,2018-07-27,19,0.74046,0.74072,0.74039,0.74041,0.74047,0.74073,0.74041,0.74043,3654
AUD/USD,2018-07-27,20,0.74041,0.74066,0.74005,0.74011,0.74043,0.74068,0.74018,0.74023,1547
AUD/USD,2018-07-25,22,0.74438,0.74526,0.74436,0.74489,0.74442,0.7453,0.74439,0.74494,2220
AUD/USD,2018-07-25,23,0.74489,0.74612,0.74489,0.7459,0.74494,0.74612,0.74492,0.74592,4886
AUD/USD,2018-07-26,0,0.7459,0.74625,0.74536,0.74571,0.74592,0.74623,0.74536,0.74573,6602
AUD/USD,2018-07-26,1,0.74571,0.74633,0.74472,0.74479,0.74573,0.74634,0.74471,0.74481,10123
AUD/USD,2018-07-26,2,0.74479,0.74485,0.74375,0.74434,0.74481,0.74487,0.74378,0.74437,7844
AUD/USD,2018-07-26,3,0.74434,0.74459,0.74324,0.744,0.74437,0.74461,0.74328,0.744,6037
AUD/USD,2018-07-26,4,0.744,0.74428,0.74378,0.74411,0.744,0.7443,0.74379,0.74414,3757
AUD/USD,2018-07-26,5,0.74411,0.74412,0.74346,0.74349,0.74414,0.74414,0.74344,0.74349,5713
AUD/USD,2018-07-26,6,0.74349,0.74462,0.74291,0.74299,0.74349,0.74464,0.74293,0.743,12650
AUD/USD,2018-07-26,7,0.74299,0.74363,0.74267,0.74361,0.743,0.74363,0.74269,0.74362,8067
AUD/USD,2018-07-26,8,0.74361,0.74375,0.74279,0.74287,0.74362,0.74376,0.7428,0.74288,6988
AUD/USD,2018-07-26,9,0.74287,0.74322,0.74212,0.74318,0.74288,0.74323,0.74212,0.74319,7784
AUD/USD,2018-07-26,10,0.74318,0.74329,0.74249,0.74276,0.74319,0.74331,0.7425,0.74276,5271
AUD/USD,2018-07-26,11,0.74276,0.74301,0.74179,0.74201,0.74276,0.74303,0.7418,0.74199,7434
AUD/USD,2018-07-26,12,0.74201,0.74239,0.74061,0.74064,0.74199,0.74241,0.74063,0.74066,20513
AUD/USD,2018-07-26,13,0.74064,0.74124,0.73942,0.74008,0.74066,0.74124,0.73943,0.74005,19715
AUD/USD,2018-07-26,14,0.74008,0.74014,0.73762,0.73887,0.74005,0.74013,0.73764,0.73889,21137
AUD/USD,2018-07-26,15,0.73887,0.73936,0.73823,0.73831,0.73889,0.73936,0.73824,0.73833,11186
AUD/USD,2018-07-26,16,0.73831,0.73915,0.73816,0.73908,0.73833,0.73916,0.73817,0.73908,6016
AUD/USD,2018-07-26,17,0.73908,0.73914,0.73821,0.73884,0.73908,0.73917,0.73823,0.73887,6197
AUD/USD,2018-07-26,18,0.73884,0.73885,0.73737,0.73773,0.73887,0.73887,0.73737,0.73775,6127
AUD/USD,2018-07-26,19,0.73773,0.73794,0.73721,0.73748,0.73775,0.73797,0.73724,0.73751,3614
AUD/USD,2018-07-26,20,0.73748,0.73787,0.73746,0.73767,0.73751,0.7379,0.73748,0.73773,1801
AUD/USD,2018-07-26,21,0.73767,0.73807,0.73755,0.73762,0.73773,0.73836,0.73769,0.73772,1687
To assign a new column avg_market_price as the average:
df = df.assign(avg_market_price=df[['openbid', 'highbid', 'lowbid', 'closebid',
'openask', 'highask', 'lowask', 'closeask']].mean(axis=1))
You then want to set the index to a datetime index by combining the date and time fields, then resample your data to daily time periods (1d). Finally, use apply to get the max, min and averge values on specific columns.
import numpy as np
>>> (df
.set_index(df['date'] + pd.to_timedelta(df['hour'], unit='h'))
.resample('1d')
.apply({'highask': 'max', 'lowask': 'min', 'avg_market_price': np.mean}))
highask lowask avg_market_price
2018-07-24 0.74300 0.74197 0.742402
2018-07-25 0.74643 0.73922 0.742142
2018-07-26 0.74634 0.73724 0.741239
2018-07-27 0.74148 0.73701 0.739011

transform raw date format into pandas date object

I have a CSV file which looks like this:
time, Numbers
[30/Apr/1998:21:30:17,24736
[30/Apr/1998:21:30:53,24736
[30/Apr/1998:21:31:12,24736
[30/Apr/1998:21:31:19,3781
[30/Apr/1998:21:31:22,-
[30/Apr/1998:21:31:27,24736
[30/Apr/1998:21:31:29,-
[30/Apr/1998:21:31:29,-
[30/Apr/1998:21:31:32,929
[30/Apr/1998:21:31:43,-
[30/Apr/1998:21:31:44,1139
[30/Apr/1998:21:31:52,24736
[30/Apr/1998:21:31:52,3029
[30/Apr/1998:21:32:06,24736
[30/Apr/1998:21:32:16,-
[30/Apr/1998:21:32:16,-
[30/Apr/1998:21:32:17,-
[30/Apr/1998:21:32:30,14521
[30/Apr/1998:21:32:33,11324
[30/Apr/1998:21:32:35,24736
[30/Apr/1998:21:32:3l8,671
[30/Apr/1998:21:32:38,1512
[30/Apr/1998:21:32:38,1136
[30/Apr/1998:21:32:38,1647
[30/Apr/1998:21:32:38,1271
[30/Apr/1998:21:32:52,5933
[30/Apr/1998:21:32:58,-
[30/Apr/1998:21:32:59,231
upto one billion,
forget about numbers column, I have a concern to convert this time-date format in my CSV file to pandas time stamp, so I can plot my dataset and visualize it according to time, as I am new in datascience,here is my approach:
step 1: take all the time colum from my CSV file into an array,
step 2: split the data from the mid where :(colon) occurs, make two new arrays of date and time,
step 3: remove "[" from date array,
step 4: replace all forward slash into dashes in the date array,
step 5: and then append date and time array to make a single pandas format,
which will be looks like this, 2017-03-22 15:16:45 as you known that I am new and my approach is naive and also wrong, if someone can help me with providing me code snippet, I will be really happy, thanks
You can pass a format to pd.to_datetime(), in this case: [%d/%b/%Y:%H:%M:%S.
Be careful with erroneous data though as seen in row 3 in sample data below ([30/Apr/1998:21:32:3l8,671). To not get an error you can pass errors=coerce, will return Not a Time (NaT).
The other way would be to replace those rows manually or write some sort of regex/replace funtion first.
import pandas as pd
data = '''\
time, Numbers
[30/Apr/1998:21:30:17,24736
[30/Apr/1998:21:30:53,24736
[30/Apr/1998:21:32:3l8,671
[30/Apr/1998:21:32:38,1512
[30/Apr/1998:21:32:38,1136
[30/Apr/1998:21:32:58,-
[30/Apr/1998:21:32:59,231'''
fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, sep=',', na_values=['-'])
df['time'] = pd.to_datetime(df['time'], format='[%d/%b/%Y:%H:%M:%S', errors='coerce')
print(df)
Returns:
time Numbers
0 1998-04-30 21:30:17 24736.0
1 1998-04-30 21:30:53 24736.0
2 NaT 671.0
3 1998-04-30 21:32:38 1512.0
4 1998-04-30 21:32:38 1136.0
5 1998-04-30 21:32:58 NaN
6 1998-04-30 21:32:59 231.0
Note that: na_values=['-'] was used here to help pandas understand the Numbers column is actually numbers and not strings.
And now we can perform actions like grouping (on minute for instance):
print(df.groupby(df.time.dt.minute)['Numbers'].mean())
#time
#30.0 24736.000000
#32.0 959.666667

Slicing my data frame is returning unexpected results

I have 13 CSV files that contain billing information in an unusual format. Multiple readings are recorded every 30 minutes of the day. Five days are recorded beside each other (columns). Then the next five days are recorded under it. To make things more complicated, the day of the week, date, and billing day is shown over the first recording of KVAR each day.
The image blow shows a small example. However, imagine that KW, KVAR, and KVA repeat 3 more times before continuing some 50 rows later.
My goal as to create a simple python script that would make the data into a data frame with the columns: DATE, TIME, KW, KVAR, KVA, and DAY.
The problem is my script returns NaN data for the KW, KVAR, and KVA data after the first five days (which is correlated with a new instance of a for loop). What is weird to me is that when I try to print out the same ranges I get the data that I expect.
My code is below. I have included comments to help further explain things. I also have an example of sample output of my function.
def make_df(df):
#starting values
output = pd.DataFrame(columns=["DATE", "TIME", "KW", "KVAR", "KVA", "DAY"])
time = df1.loc[3:50,0]
val_start = 3
val_end = 51
date_val = [0,2]
day_type = [1,2]
# There are 7 row movements that need to take place.
for row_move in range(1,8):
day = [1,2,3]
date_val[1] = 2
day_type[1] = 2
# There are 5 column movements that take place.
# The basic idea is that I would cycle through the five days, grab their data in a temporary dataframe,
# and then append that dataframe onto the output dataframe
for col_move in range(1,6):
temp_df = pd.DataFrame(columns=["DATE", "TIME", "KW", "KVAR", "KVA", "DAY"])
temp_df['TIME'] = time
#These are the 3 values that stop working after the first column change
# I get the values that I expect for the first 5 days
temp_df['KW'] = df.iloc[val_start:val_end, day[0]]
temp_df['KVAR'] = df.iloc[val_start:val_end, day[1]]
temp_df['KVA'] = df.iloc[val_start:val_end, day[2]]
# These 2 values work perfectly for the entire data set
temp_df['DAY'] = df.iloc[day_type[0], day_type[1]]
temp_df["DATE"] = df.iloc[date_val[0], date_val[1]]
# trouble shooting
print(df.iloc[val_start:val_end, day[0]])
print(temp_df)
output = output.append(temp_df)
# increase values for each iteration of row loop.
# seems to work perfectly when I print the data
day = [x + 3 for x in day]
date_val[1] = date_val[1] + 3
day_type[1] = day_type[1] + 3
# increase values for each iteration of column loop
# seems to work perfectly when I print the data
date_val[0] = date_val[0] + 55
day_type [0]= day_type[0] + 55
val_start = val_start + 55
val_end = val_end + 55
return output
test = make_df(df1)
Below is some sample output. It shows where the data starts to break down after the fifth day (or first instance of the column shift in the for loop). What am I doing wrong?
Could be pd.append requiring matched row indices for numerical values.
import pandas as pd
import numpy as np
output = pd.DataFrame(np.random.rand(5,2), columns=['a','b']) # fake data
output['c'] = list('abcdefghij') # add a column of non-numerical entries
tmp = pd.DataFrame(columns=['a','b','c'])
tmp['a'] = output.iloc[0:2, 2]
tmp['b'] = output.iloc[3:5, 2] # generates NaN
tmp['c'] = output.iloc[0:2, 2]
data.append(tmp)
(initial response)
How does df1 look like? Is df.iloc[val_start:val_end, day[0]] have any issue past the fifth day? The codes didn't show how you read from the csv files, or df1 itself.
My guess: if val_start:val_end gives invalid indices on the sixth day, or df1 happens to be malformed past the fifth day, df.iloc[val_start:val_end, day[0]] will return an empty Series object and possibly make its way into temp_df. iloc do not report invalid row indices, though similar column indices would trigger IndexError.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(5,3), columns=['a','b','c'], index=np.arange(5)) # fake data
df.iloc[0:2, 1] # returns the subset
df.iloc[100:102, 1] # returns: Series([], Name: b, dtype: float64)
A little off topic but I would recommend preprocessing the csv files rather than deal with indexing in Pandas DataFrame, as the original format was kinda complex. Slice the data by date and later use pd.melt or pd.groupby to shape them into the format you like. Or alternatively try multi-index if stick with Pandas I/O.

Categories

Resources