Align years of daily data - python

Starting from a multi-annual record of temperature measured at different time in the day, I would like to end up with a rectangular array of daily averages, each row representing one year of data.
The data looks like this
temperature.head()
date
1996-01-01 00:00:00 7.39
1996-01-01 03:00:00 6.60
1996-01-01 06:00:00 7.39
1996-01-01 09:00:00 9.50
1996-01-01 12:00:00 11.00
Name: temperature, dtype: float64
I computed daily averages with
import pandas as pd
daily = temperature.groupby(pd.TimeGrouper(freq='D')).mean()
Which yields
daily.head()
date
1996-01-01 9.89625
1996-01-02 10.73625
1996-01-03 6.98500
1996-01-04 5.62250
1996-01-05 8.84625
Freq: D, Name: temperature, dtype: float64
Now for the final part I thought of something like
yearly_daily_mean = daily.groupby(pd.TimeGrouper(freq='12M', closed="left"))
but there are some issues here.
I need to drop the tail of the data not filling a complete year.
What happens if there is missing data?
How to deal with the leap years?
What is the next step? Namely, how to “stack” (in numpy's, not pandas' sense) the years of data?
I am using
array_temperature = np.column_stack([group[1] for group in yearly_daily_mean if len(group[1]) == 365])
but there should be a better way.
As a subsidiary question, how can I choose the starting day of the years of data?

If I understand you correctly, you want to reshape your timeseries of daily means (which you already calculated) to a rectangular dataframe with the different days as columns and the different years as rows.
This can be achieved easily with the pandas reshaping functions, eg with pivot:
Some dummy data:
In [45]: index = pd.date_range(start=date(1996, 1,1), end=date(2010, 6, 30), freq='D')
In [46]: daily = pd.DataFrame(index=index, data=np.random.random(size=len(index)), columns=['temperature'])
First, I add columns with the year and day of the year:
In [47]: daily['year'] = daily.index.year
In [48]: daily['day'] = daily.index.dayofyear
In [49]: daily.head()
Out[49]:
temperature year day
1996-01-01 0.081774 1996 1
1996-01-02 0.694968 1996 2
1996-01-03 0.478050 1996 3
1996-01-04 0.123844 1996 4
1996-01-05 0.426150 1996 5
Now, we can reshape this dataframe:
In [50]: daily.pivot(index='year', columns='day', values='temperature')
Out[50]:
day 1 2 ... 365 366
year ...
1996 0.081774 0.694968 ... 0.679461 0.700833
1997 0.043134 0.981707 ... 0.009357 NaN
1998 0.257077 0.297290 ... 0.701941 NaN
... ... ... ... ... ...
2008 0.047145 0.750354 ... 0.996396 0.761159
2009 0.348667 0.827057 ... 0.881424 NaN
2010 0.269743 0.872655 ... NaN NaN
[15 rows x 366 columns]

Here is how I would do it. Very simply: create a new df with the exact shape you want, then fill it with the means of the things you want.
from datetime import datetime
import numpy as np
import pandas as pd
# This is my re-creation of the data you have. (I'm calling it df1.)
# It's essential that your date-time be in datetime.datetime format, not strings
byear = 1996 # arbitrary
eyear = 2005 # arbitrary
obs_n = 50000 # arbitrary
start_time = datetime.timestamp(datetime(byear,1,1,0,0,0,0))
end_time = datetime.timestamp(datetime(eyear,12,31,23,59,59,999999))
obs_times = np.linspace(start_time,end_time,num=obs_n)
index1 = pd.Index([datetime.fromtimestamp(i) for i in obs_times])
df1 = pd.DataFrame(data=np.random.rand(obs_n)*20,index=index1,columns=['temp'])
# ^some random data
# Here is the new empty dataframe (df2) where you will put your daily averages.
index2 = pd.Index(range(byear,eyear+1))
columns2 = range(1,367) # change to 366 if you want to assume 365-day years
df2 = pd.DataFrame(index=index2,columns=columns2)
# Some quick manipulations that allow the two dfs' indexes to talk to one another.
df1['year'] = df1.index.year # a new column with the observation's year as an integer
df1['day'] = df1.index.dayofyear # a new column with the day of the year as integer
df1 = df1.reset_index().set_index(['year','day'])
# Now get the averages for each day and assign them to df2.
for year in index2:
for day in columns2[:365]: # for all but the last entry in the range
df2.loc[year,day] = df1.loc[(year,day),'temp'].mean()
if (year,366) in df1.index: # then if it's a leap year...
df2.loc[year,366] = df1.loc[(year,366),'temp'].mean()
If you don't want the final df to have any null values on that 366th day, then you can just remove the final if-statement, and rewrite columns2 = range(1,366), and then df2 will have all non-null values (assuming there was at least one measurement on every day in the observed time period).

Assuming you already have daily averages (with pd.DateTimeIndex) from your higher-frequency data as a result of:
daily = temperature.groupby(pd.TimeGrouper(freq='D')).mean()
IIUC, you want to transform the daily average into a DataFrame with an equal number of columns per row to capture annual data. You mention leap years as a potential issue when aiming for an equal number of columns.
I can imagine two ways of going about this:
Select a number of days per row - probably 365. Select rolling blocks of 365 consecutive daily data points for each row and align by index for each of these blocks.
Select years of data, filling in the gaps for leap years, and align by either MM-DD or number of day in year.
Starting with 20 1/2 years of daily random data as mock daily average temperatures:
index = pd.date_range(start=date(1995, 1,1), end=date(2015, 6, 30), freq='D')
df = pd.DataFrame(index=index, data=np.random.random(size=len(index)) * 30, columns=['temperature'])
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 7486 entries, 1995-01-01 to 2015-06-30
Freq: D
Data columns (total 1 columns):
temperature 7486 non-null float64
dtypes: float64(1)
memory usage: 117.0 KB
None
df.head()
temperature
1995-01-01 4.119212
1995-01-02 27.107131
1995-01-03 26.704931
1995-01-04 7.430203
1995-01-05 4.230398
df.tail()
temperature
2015-06-26 10.902779
2015-06-27 8.494378
2015-06-28 17.800131
2015-06-29 19.543815
2015-06-30 16.390435
Here's a solution to the first approach:
Select blocks of 365 consecutive days using .groupby(pd.TimeGrouper('365D')), and return each resulting groupby object of daily averages as a pd.DataFrame with an integer index that runs from 0 to 364 for each sequence:
aligned = df.groupby(pd.TimeGrouper(freq='365D')).apply(lambda x: pd.DataFrame(x.squeeze().tolist())) # .squeeze() converts single columns `DataFrame` to pd.Series
To align the 21 blocks of data, just transpose the pd.DataFrame , and they will align by integer index in the columns, with the start date of each sequence in theindex. This operation will produce an extraindex, and the lastrow` will have some missing data. Clean up both with:
aligned.dropna().reset_index(-1, drop=True)
to get a [20 x 365] DataFrame as follows:
DatetimeIndex: 20 entries, 1995-01-01 to 2013-12-27
Freq: 365D
Columns: 365 entries, 0 to 364
dtypes: float64(365)
memory usage: 57.2 KB
0 1 2 3 4 5 \
1995-01-01 29.456090 25.313968 4.146206 5.347690 25.767425 11.978152
1996-01-01 25.585481 26.846486 8.336905 16.749842 6.247542 17.723733
1996-12-31 23.410462 10.168599 5.601917 11.996500 8.650726 23.362815
1997-12-31 7.586873 23.882106 22.145595 3.287160 21.642547 1.949321
1998-12-31 14.691420 3.611475 28.287327 25.347787 13.291708 20.571616
1999-12-31 25.713866 17.588570 18.562117 19.420944 12.406293 11.870750
2000-12-30 5.099561 17.894763 21.168223 4.786461 24.521417 21.443607
2001-12-30 11.791223 8.352493 12.731769 0.459697 20.680396 27.554783
2002-12-30 3.785876 0.359850 20.828764 15.376991 14.086626 0.477615
2003-12-30 23.633243 12.726250 8.197824 16.355956 8.094145 1.410746
2004-12-29 1.139949 4.161267 9.043062 14.109888 13.538735 1.566002
2005-12-29 25.504224 19.346419 3.300641 26.933084 23.634321 18.323450
2006-12-29 10.535785 9.168498 27.222106 11.962343 10.004678 23.893257
2007-12-29 27.482856 6.910670 6.033291 12.673530 26.362971 4.492178
2008-12-28 11.152316 25.233664 22.124299 11.012285 1.992814 25.542204
2009-12-28 23.131021 16.363467 1.242393 10.387653 4.858851 26.553950
2010-12-28 13.134843 9.195658 19.075850 28.539387 3.075934 8.089347
2011-12-28 28.860275 10.121573 0.663906 19.687892 29.376377 11.488446
2012-12-27 7.644073 19.649330 25.497595 6.592940 8.879444 17.733670
2013-12-27 11.713996 2.602284 3.835302 22.244623 27.279810 14.144943
6 7 8 9 ... 355 \
1995-01-01 8.210005 8.129146 28.798472 25.646924 ... 24.177163
1996-01-01 0.481487 16.772357 3.934185 22.640157 ... 23.340931
1996-12-31 10.813812 16.276504 3.422665 14.916229 ... 13.817015
1997-12-31 19.184753 28.628326 22.134871 12.721064 ... 23.905483
1998-12-31 2.839492 7.889141 17.951959 25.233585 ... 28.002751
1999-12-31 6.958672 26.335427 23.361470 5.911806 ... 7.778412
2000-12-30 8.405042 25.229016 19.746462 15.332004 ... 5.703830
2001-12-30 0.558788 15.457327 20.987186 25.452723 ... 29.771372
2002-12-30 19.002685 26.455754 25.468178 25.383786 ... 14.238987
2003-12-30 22.984328 15.934398 25.361599 12.221306 ... 1.189949
2004-12-29 22.121901 21.421103 26.175702 16.040881 ... 19.945408
2005-12-29 2.557901 15.193412 27.049389 4.825570 ... 7.629859
2006-12-29 8.582602 26.037375 0.933591 13.469771 ... 29.453932
2007-12-29 29.437921 26.470153 9.917871 16.875801 ... 5.702116
2008-12-28 3.809633 10.583385 18.029571 0.440077 ... 11.337894
2009-12-28 24.406696 28.294553 19.929563 4.683991 ... 25.697446
2010-12-28 29.765551 16.716723 6.467946 10.998447 ... 26.988863
2011-12-28 28.962746 11.407137 9.957111 4.502521 ... 14.606937
2012-12-27 1.374502 5.571244 11.212960 9.949830 ... 23.345868
2013-12-27 26.373866 4.781510 16.828510 10.280078 ... 0.552726
356 357 358 359 360 361 \
1995-01-01 13.511951 10.126835 28.121730 23.275360 11.785242 27.907039
1996-01-01 13.362737 14.336780 24.114908 28.479688 8.509069 17.408937
1996-12-31 19.192674 1.146844 27.499688 7.090407 2.777819 22.826814
1997-12-31 21.502186 10.495148 21.786895 12.229181 8.068271 6.522108
1998-12-31 21.338355 11.978265 9.186161 21.053924 3.033370 29.934703
1999-12-31 5.960120 20.325684 0.915052 15.059979 12.194240 20.138567
2000-12-30 11.883186 2.764768 27.324304 29.630706 21.852058 20.416199
2001-12-30 7.802891 25.384479 9.044486 8.809446 7.606603 6.051890
2002-12-30 7.362494 8.940783 5.259984 7.035818 24.094134 7.197113
2003-12-30 25.596902 9.756372 6.345198 1.520188 22.752717 3.470268
2004-12-29 26.789064 9.708466 18.287838 21.134643 29.862135 19.926086
2005-12-29 26.398394 24.717514 16.606042 28.189245 24.574806 14.297410
2006-12-29 8.795342 18.019536 16.579878 20.368811 22.052442 26.393676
2007-12-29 8.696240 25.901889 16.410934 15.274897 14.365867 10.523388
2008-12-28 18.581513 25.974784 21.025297 10.521118 5.864974 2.373023
2009-12-28 14.437944 21.717456 4.017870 14.024522 0.959989 17.215403
2010-12-28 11.426540 13.751451 4.664761 15.373878 7.731613 7.269089
2011-12-28 1.952897 9.406866 28.957258 20.239517 11.156958 29.238761
2012-12-27 7.588643 21.186675 17.348911 1.354323 13.918083 3.034123
2013-12-27 22.916065 2.089675 22.832061 14.787841 25.697875 14.087893
362 363 364
1995-01-01 13.107523 10.740551 20.511825
1996-01-01 25.016219 17.885332 2.438875
1996-12-31 24.692327 0.221760 6.749919
1997-12-31 24.856169 0.930019 22.603652
1998-12-31 18.361414 13.587695 25.161495
1999-12-31 0.512120 26.482288 1.035197
2000-12-30 15.401012 28.334219 5.965014
2001-12-30 10.292213 10.951915 8.270319
2002-12-30 21.945734 27.076438 6.795688
2003-12-30 14.788929 19.456459 11.216835
2004-12-29 7.086443 25.463503 17.549196
2005-12-29 12.252487 29.081547 25.507369
2006-12-29 0.012617 0.086186 17.421958
2007-12-29 4.191633 21.588891 7.516187
2008-12-28 26.194288 20.500256 24.876032
2009-12-28 28.445254 27.338754 7.849899
2010-12-28 28.888573 26.801262 23.117027
2011-12-28 19.871547 20.324514 18.369134
2012-12-27 15.907752 9.417700 4.922940
2013-12-27 21.132385 20.707216 5.288128
[20 rows x 365 columns]
If you want to simply gather the years of data and align by date, so that the non-leap years have a missing day around no 60 (as opposed to 366), you can:
df.groupby(pd.TimeGrouper(freq='A')).apply(lambda x: pd.DataFrame(x.squeeze().tolist()).T).reset_index(-1, drop=True).iloc
0 1 2 3 4 5 \
1995-12-31 1.245796 28.487530 0.574299 10.033485 19.221512 8.718728
1996-12-31 12.258653 3.864652 25.237088 13.982809 24.494746 13.822292
1997-12-31 22.239412 4.796824 21.389404 11.151171 25.577368 1.754948
1998-12-31 24.968287 2.089894 25.888487 28.291714 19.115844 24.426285
1999-12-31 9.285363 19.339405 26.012193 3.243394 25.176499 8.766770
2000-12-31 26.996573 26.404391 1.793644 21.314488 13.118279 26.703532
2001-12-31 16.303829 14.021771 20.828238 11.427195 3.099290 18.730795
2002-12-31 14.614617 10.694258 5.226033 24.900849 17.395822 22.154202
2003-12-31 10.564132 8.267639 7.778573 26.704936 5.671499 0.470963
2004-12-31 22.649623 15.725867 18.445629 7.529507 11.868134 10.965534
2005-12-31 2.406615 9.709624 23.284616 11.479254 23.814725 1.656826
2006-12-31 19.164459 23.177769 16.091672 28.936777 28.636072 4.838555
2007-12-31 12.371377 3.417582 21.067689 25.493921 25.410295 15.526614
2008-12-31 29.080385 4.653984 16.567333 24.248921 27.338538 9.353291
2009-12-31 29.608734 6.046593 22.738628 22.631714 26.061903 21.217846
2010-12-31 27.458254 15.146497 18.917073 8.473955 26.782767 10.891648
2011-12-31 25.433759 8.959650 14.343507 16.249726 17.031174 12.944418
2012-12-31 22.940797 4.791280 11.765939 25.925645 3.649440 27.483407
2013-12-31 11.684391 27.701678 27.423083 27.656086 9.374896 14.250936
2014-12-31 23.660098 27.768960 25.753294 3.014606 23.330226 17.570492
6 7 8 9 ... 356 \
1995-12-31 17.079137 26.100763 12.376462 12.315219 ... 16.910185
1996-12-31 26.718277 10.349412 12.940624 9.453769 ... 19.235435
1997-12-31 20.201528 22.895552 1.443243 20.584140 ... 29.665815
1998-12-31 21.493163 16.724328 5.946833 15.230762 ... 2.617883
1999-12-31 9.776013 13.381424 11.028295 1.905501 ... 7.200409
2000-12-31 9.773097 14.565345 22.578398 0.688273 ... 18.119020
2001-12-31 1.095308 14.817514 25.652418 8.327481 ... 15.385689
2002-12-31 29.744794 15.545211 6.373948 13.451261 ... 7.446414
2003-12-31 14.971959 25.948332 21.596976 5.355589 ... 23.676867
2004-12-31 0.604113 2.858745 0.120340 19.365223 ... 0.336213
2005-12-31 6.260722 9.819337 19.573953 11.132919 ... 26.107100
2006-12-31 10.341241 15.126506 3.349634 23.619127 ... 15.508680
2007-12-31 20.033540 22.103483 7.674852 1.263726 ... 15.148461
2008-12-31 28.233973 27.982105 17.037928 5.389418 ... 8.773618
2009-12-31 4.400039 7.284556 11.825382 4.201001 ... 6.734423
2010-12-31 26.086305 26.275027 8.069376 19.200344 ... 19.056528
2011-12-31 29.215028 0.985623 4.813478 7.752540 ... 14.395423
2012-12-31 4.690336 9.618306 25.492041 10.400292 ... 8.853903
2013-12-31 8.227096 11.013431 0.996911 15.276574 ... 26.227540
2014-12-31 23.440591 16.544698 2.263684 3.919315 ... 24.987387
357 358 359 360 361 362 \
1995-12-31 24.791125 21.443534 21.092439 8.289222 9.745293 20.084046
1996-12-31 2.632656 2.102163 24.828437 18.104255 7.951859 3.266873
1997-12-31 11.246534 14.086539 29.635519 19.518642 24.086108 6.041870
1998-12-31 29.961162 9.924863 9.401790 25.597344 13.885467 16.537406
1999-12-31 3.057125 15.241720 8.472388 3.248545 11.302522 19.283612
2000-12-31 22.999729 17.518504 10.058249 2.953903 10.167712 17.309525
2001-12-31 18.267445 23.205300 25.658591 19.915797 10.704525 26.604965
2002-12-31 11.497110 3.641206 9.693428 24.571510 6.438652 29.280098
2003-12-31 23.931401 19.967615 0.307896 0.385782 0.579257 7.534806
2004-12-31 21.321146 9.224362 1.703842 6.180944 28.173925 5.178336
2005-12-31 17.990409 28.746179 2.524899 10.555224 25.487723 19.877390
2006-12-31 9.748760 29.069966 1.717175 3.283069 9.615215 25.787787
2007-12-31 29.772930 20.892030 16.597493 20.079373 17.320327 9.583089
2008-12-31 22.787891 26.636413 13.872783 29.305847 21.287553 1.263788
2009-12-31 1.574188 23.172773 0.967153 1.928999 12.201354 0.125939
2010-12-31 20.566125 0.429552 4.413156 16.106451 27.745684 18.280928
2011-12-31 9.348584 2.604338 23.397221 7.378340 16.757224 29.364973
2012-12-31 4.704570 7.278321 19.034622 24.597784 13.694635 15.912901
2013-12-31 21.657446 14.110146 23.976991 8.203509 20.083490 4.471119
2014-12-31 14.465823 9.105391 15.984162 6.796756 8.232619 18.761280
363 364 365
1995-12-31 28.165022 9.735041 NaN
1996-12-31 11.644543 4.139818 5.420238
1997-12-31 2.500165 18.290531 NaN
1998-12-31 23.856333 10.064951 NaN
1999-12-31 3.090008 26.203395 NaN
2000-12-31 22.216599 27.942821 0.791318
2001-12-31 25.682003 4.766435 NaN
2002-12-31 19.785159 28.972659 NaN
2003-12-31 15.692168 21.388069 NaN
2004-12-31 9.079675 7.392328 12.583179
2005-12-31 18.202333 21.895494 NaN
2006-12-31 20.951937 26.220226 NaN
2007-12-31 23.603166 28.165377 NaN
2008-12-31 20.532933 9.401494 25.296916
2009-12-31 5.879644 10.377044 NaN
2010-12-31 0.436284 20.875852 NaN
2011-12-31 13.205290 6.832805 NaN
2012-12-31 23.253155 17.760731 23.270751
2013-12-31 19.807798 2.453238 NaN
2014-12-31 12.817601 11.756561 NaN
[20 rows x 366 columns]

Related

looping over variables in pandas and for each month, computing OHLC and storing data into new dataframe's columns

I have a financial close price (weekly) data which looks like this:
Date V1 V2 V3 V4 V5 V6 V7
2010-01-01 77.31 66.94 52.33 34.94 81.38 84.75 482
2010-01-08 78.05 68.85 52.84 34.66 90.15 95.61 508
2010-01-15 79.29 68.3 53.61 35.33 86.97 97.87 490
2010-01-22 80.57 68.19 55.43 35.8 86.04 99.26 480
2010-01-29 81.87 68.79 55.84 35.6 83.36 98.53 462
2010-02-05 83.7 70.35 57.3 36.57 84.54 91.83 464
2010-02-12 81.85 68.32 56.4 37.35 81.2 90.75 455
2010-02-19 82.66 69.04 56.21 36.89 81.85 93.98 457
2010-02-26 86.32 69.7 57.43 37.12 83.96 96.43 467
2010-03-05 85.37 69.98 57.34 36.71 84.01 94.83 466
2010-03-12 84.04 69.76 56.74 36.98 83.02 93.92 466
2010-03-19 84.37 69.76 56.77 37.07 83.29 95.04 458
2010-03-26 85.7 70.06 56.62 36.81 81.64 94.84 459
2010-04-02 85.38 70.72 56.03 36.78 83.91 94.98 464
2010-04-09 89.21 71.7 58.38 37.49 86.95 98.74 471
2010-04-16 89.74 72.35 58.74 38.05 85.58 98.28 487
2010-04-23 90.72 74.26 60.61 38.64 90.5 100.18 492
2010-04-30 99.79 78.67 65.14 38.89 95.82 108.87 494
2010-05-07 102.34 81.48 63.45 41.87 93.18 106.2 478
2010-05-14 96.42 79.81 62.57 41.23 88.94 102.23 484
2010-05-21 96.17 76.9 61.06 39.28 88.22 97.8 444
2010-05-28 95.73 77.67 61.1 39.88 92.88 96.84 421
Here V1, V2... V7 are some companies for which weekly closing prices are given in the above table data.
What I want to do is to calculate:
for each month, what is open, close, high and low prices
open should be the price on the first date in Date column for that month, and close should be the last date, obviously, right?
I am using the below code which returns me the result as shown below the code:
def calculate(x):
open = x.loc[x.index.min(), "V1"]
high = x.loc[x.index, "V1"].max()
low = x.loc[x.index, "V1"].min()
close = x.loc[x.index.max(), "V1"]
return open, high, low, close
result = pd.DataFrame()
result = df.groupby(df["Date"].dt.to_period("M")).apply(calculate)
result
Result of the above:
Date
2010-01 (77.31, 81.87, 77.31, 81.87)
2010-02 (83.7, 86.32, 81.85, 86.32)
2010-03 (85.37, 85.7, 84.04, 85.7)
2010-04 (85.38, 99.79, 85.38, 99.79)
2010-05 (102.34, 102.34, 95.73, 95.73)
...
Now I wanted to take these tuples into respective columns, along with the Date:
Date, Open, High, Low, Close
And also,
2. I want to repeat above function for all the variables (V1 through V7) using single loop operation or something.
Could someone please suggest me how I should be able to do that?
IIUC, you can also try:
df.Date = pd.to_datetime(df.Date)
df = df.sort_values('Date')
df = (df.groupby(pd.Grouper(key = 'Date', freq = '1M')).agg(**{'open': ('V1','first'), 'high' : ('V1', 'max'), 'low' : ('V1', 'min'), 'close': ('V1','last')}))
OUTPUT:
open high low close
Date
2010-01-31 77.31 81.87 77.31 81.87
2010-02-28 83.70 86.32 81.85 86.32
2010-03-31 85.37 85.70 84.04 85.70
2010-04-30 85.38 99.79 85.38 99.79
2010-05-31 102.34 102.34 95.73 95.73
NOTE: you can also use resample:
df = df.set_index('Date').resample('1M').agg({'V1': ['min', 'max', 'first', 'last']})
Updated Answer:
`
df1 = df.set_index('Date').resample('1M').agg(['min', 'max', 'first', 'last'])
df1.columns = [f'{i}_{j}' for i,j in zip(df1.columns.get_level_values(0),df1.columns.get_level_values(1).map({'min': 'low', 'max': 'high', 'first': 'open', 'last': 'close'}))]

How to access last year values to compare year on year ? datetime index

I have a dataframe of 2 years worth of dates and revenue on given day,
import pandas as pd
import datetime
import itertools
import time
import plotly.graph_objects as go
startDate = datetime.date(2018,1,1)
endDate = datetime.date(2019,12,31)
date_range = pd.date_range(start=startDate, end=endDate)
performance_df = pd.DataFrame(date_range)
performance_df.columns = ['Date']
performance_df = performance_df.set_index(['Date'])
performance_df['Revenue'] = [25891.84678700861, 25851.615540667623, 25037.71189951304, 26715.764965288607, 23988.35694961679, 19029.057983049668, 16935.481705163278, 22756.072913397493, 30385.672828716073, 32970.13217533639, 31089.167074855934, 24262.972414940097, 18261.273831731618, 18304.754083985797, 26297.835664941533, 32619.66940484445, 35565.26222544722, 33229.97193979324, 25405.647135516112, 19980.890374561146, 20487.553160719217, 29709.0323217462, 38164.493647661984, 39050.80114673623, 36612.554432511824, 28169.78252364524, 22086.641617812107, 21631.662705640312, 28419.94529036299, 35644.61736420142, 35829.065860994495, 32907.079142030154, 25951.247521574016, 22888.00983435945, 22582.648252027546, 30024.92542296243, 37891.251492167445, 39065.307017542225, 35326.30407697067, 28447.88908662872, 25042.500493029664, 26403.83421252776, 32380.475740928025, 36605.55089473326, 36006.56039455697, 35189.153100968004, 29780.77465095395, 24909.218739056418, 23685.75537938559, 31839.08994457272, 39061.7208522828, 39973.50309446715, 36623.93766798115, 28038.08152342491, 23004.30712890111, 22349.30571082852, 29259.27790736973, 36562.99112657728, 34942.83314919648, 30908.429691071553, 25025.293504644822, 22417.499234687977, 23235.05923247665, 30142.36658055089, 37322.22656885001, 39533.654081050176, 38817.55694852113, 32902.066818425184, 27289.215659267025, 26836.10240333383, 32714.6554385672, 40479.10841583944, 43032.710867507936, 42172.617851188385, 34581.509020848, 28796.31836571319, 28742.324093757332, 34514.108362569794, 41762.5838531726, 43526.12116978522, 39641.50434996709, 30279.354030113776, 23901.27040606382, 24482.3224100694, 34144.561683174434, 40900.767127300045, 41325.58696351466, 32836.06314047833, 28141.78555667737, 30152.48501882366, 31302.601421675394, 30866.3243266088, 35869.875330241855, 40580.31733582241, 41993.5864357607, 39319.78415151001, 35718.297902676924, 36181.313810975684, 41230.810723606075, 46047.43448330563, 47857.354050289476, 45322.56791751129, 37822.96403899934, 31215.19286619295, 29918.90585181318, 35176.29194324105, 43574.87496458186, 46034.04274563455, 43441.151399711365, 34723.430758204755, 27862.663129803153, 28567.807404056617, 37133.90907964683, 43009.01499641166, 45626.712492547886, 43142.632484355025, 33862.51562817326, 27397.539418702294, 27826.66395150021, 33130.40536093521, 39810.27565167983, 43719.59292308625, 41154.292760403885, 35925.79865016479, 32567.778590584392, 33297.526224142544, 32947.89653815421, 42798.953783336874, 47614.60147123756, 45254.62604676405, 38343.504556057334, 33179.627837955115, 33582.52982828824, 42105.00412410267, 48043.66651093638, 48837.40726130669, 48430.69364401822, 41076.68912572308, 34892.890461267096, 34935.159059467, 43160.00734032636, 52344.94145539043, 53619.43675580939, 52103.10220317212, 46364.68259720105, 43350.64074445112, 43840.939241180975, 41388.2098953964, 50023.86997334399, 53997.492172694445, 51246.8781429738, 44896.933023168574, 39424.66568855407, 39201.81892657556, 48299.32823370456, 57482.470637739105, 59523.110347325885, 55947.08067737157, 46045.719863431106, 38883.154337118904, 39475.701405236185, 49427.033216828415, 60480.43065506853, 61135.583867704656, 56363.832578283495, 46645.212577169856, 38947.471275125194, 39329.754583734706, 50567.536867536335, 60653.82734696712, 63896.50017170786, 59898.432410040055, 48184.54638977423, 41190.50886536953, 41376.579389025275, 51676.40473294583, 60212.103940879424, 61073.542419917525, 57551.1469742174, 46747.318316404155, 40323.9814761604, 41029.31091363546, 49818.65317184012, 58397.91950877408, 61499.7188209775, 58381.773792521184, 51645.68301028936, 46686.85877971279, 46284.20382322595, 53168.20709487424, 59629.08096569072, 61559.59693441008, 58214.20592954166, 50189.35908931843, 44830.32689867108, 44553.609729770156, 50596.800094530976, 57784.320856900726, 58856.24561158704, 55747.50121815237, 45909.22320056169, 39700.62514340684, 40026.08060928283, 45670.31405277475, 53989.53076463664, 56840.67697150013, 51990.04207895543, 44479.51204240872, 40155.77712289925, 39704.41534828166, 44546.13749709498, 47853.054952435254, 47513.191562263164, 44746.64958742695, 38570.66136465314, 34913.4920574524, 34960.54662729273, 38809.56679736621, 44328.06662512622, 46786.97399754649, 43176.24626069548, 38761.63401887685, 35505.42439182791, 34491.44625748903, 38371.78994694245, 43751.84248007749, 45179.535352503226, 42830.262169078655, 37807.84955152587, 33490.071062830524, 33451.981211263366, 38489.460640344005, 44884.80556430299, 48173.59627480145, 45230.34903136869, 40408.815586107376, 37482.963570560685, 37298.73472663822, 35613.14542796679, 44369.054329647, 48985.813091293036, 47004.185926539445, 40072.386470837155, 36025.40024944878, 37087.92873340035, 43829.83096193774, 53122.12039373634, 56019.219188405055, 56327.64714973823, 49299.052443800436, 43566.64775098363, 44683.28514853064, 53151.23631794093, 62692.280707330996, 64588.59942255457, 61225.42426070072, 52811.43214958245, 45888.1735729261, 46327.71392726898, 55065.63459685981, 63657.55109568991, 66424.12156809494, 64999.66733479166, 57939.99891629782, 50574.620616169435, 50558.85074509659, 58418.01819988318, 64054.52320755815, 65362.71077771696, 63212.24817914635, 53816.39717322037, 47299.69358112465, 46438.29122104288, 51606.847773139874, 58286.00492998514, 60724.299744674456, 60074.178339144695, 53563.16147623882, 47537.06339596468, 48392.602700494775, 56176.46157312282, 64575.322111131725, 66575.84159174575, 64667.37570830546, 58256.091303140376, 53478.61964952481, 55099.86601961843, 62950.88160139487, 67181.84218779847, 66728.1189827789, 64883.456569064045, 56046.87389471389, 49400.485446729814, 47508.2631477567, 50017.97093003997, 54896.861775391895, 54433.64513960683, 50522.90141287548, 40609.51437863042, 33685.99041915158, 31782.758247954887, 36404.455159974576, 40003.119259204075, 39722.56468522227, 38114.81374102996, 32394.83359664386, 27315.366586900353, 27310.107293953042, 33405.16835200959, 42120.632096858724, 43190.48348102931, 41290.86540942159, 35119.35131893462, 28756.590790603772, 28610.11081953303, 35550.66376207889, 41120.32617529186, 42589.10273496922, 42320.707348300246, 35497.925445967456, 28680.144395217914, 28433.68805319704, 34480.02122168917, 40808.10672190518, 41632.86607595266, 39212.58899530489, 32942.6873470945, 27158.723820603347, 27161.92132942049, 33334.34617535648, 38679.06248665687, 40681.03562440046, 39477.59519930245, 34513.37459740981, 28345.14667273714, 28289.697206577042, 34737.3173677138, 40574.91034815245, 40556.06688657629, 39927.322507441866, 34634.7483828078, 27666.467364275486, 27774.36383185118, 33950.84687537262, 39518.06131165054, 39587.56870083202, 38832.66031065059, 32258.462222065184, 23343.00831465727, 23914.89577468648, 28173.094174897382, 30306.555827203953, 28284.310391780135, 24228.75442600916, 20495.999364892246, 19302.93644485608, 21391.090776536974, 21072.220129100904, 19770.681250398102, 22751.205447107975, 25744.075479601106, 27119.697588885116, 28894.626077292316, 30321.364424584666, 29665.55870322018, 28601.71879337108, 28071.180317842627, 27522.026515632668, 25081.934367325113, 21303.503766392783, 18866.89154435026, 23938.585815421528, 29814.69141061624, 33132.41332574798, 32403.437424673357, 25058.826704400228, 19710.392712492816, 20359.305357642494, 28801.270965356667, 36009.139554077155, 38616.97439807099, 36667.40186755878, 28534.365261187533, 22467.35716166321, 22792.672711448053, 29707.297931805748, 37470.29566599212, 39159.19770376187, 35885.39779973568, 28198.615393579756, 22794.986958484948, 22618.6644213648, 31692.436151849328, 38546.03983274927, 38392.05968074403, 36221.937805245456, 28193.106505443564, 24431.214224965315, 24273.041173434154, 33527.551077655895, 39799.45238198621, 42167.39446971978, 38724.44521961169, 31303.32596831539, 28852.214711208617, 29328.603401220582, 38595.29805477188, 46086.742816943195, 46212.71514312982, 40873.403063823505, 34050.57262699937, 30182.502158521005, 29877.84162199769, 36061.93273317978, 42775.17967463026, 46034.16044887075, 41613.766613228676, 33678.29697864473, 28330.525176789113, 27672.35860253945, 35245.249394927574, 44232.03285856061, 45226.19144817676, 43043.919289296115, 33089.63437761405, 27591.940183796116, 27719.98729388228, 34909.70477643947, 43035.046467701366, 45301.769969577756, 42110.40663131329, 33482.256424365776, 29072.549855117086, 28488.907914432693, 37030.57030038991, 44216.431591844026, 46331.06515629943, 45488.943714074994, 37613.19663707921, 31321.72125252333, 30932.753428207085, 37772.776810934076, 45796.64013962434, 47820.36900857583, 45315.76501111126, 35448.39425262605, 29850.35013283804, 29591.159982436213, 37654.079017344375, 45495.78841466611, 47908.959672341305, 44361.65729402939, 35983.7996238946, 32075.219160365035, 33116.287276080366, 42322.98190708696, 51963.46941425464, 52271.57848952532, 52043.23599364105, 45266.807463334626, 40278.50721681852, 41783.0851672551, 49025.27275062121, 57753.777188678556, 57148.82659973791, 51158.111967429824, 40270.45350840617, 32304.431822795817, 32070.63494283356, 37752.91137050926, 42330.03164511995, 42399.727284674096, 36018.92734513604, 30062.163733450554, 29588.1255976938, 29092.231273784368, 26893.12179696699, 34929.949841169895, 42461.99046702016, 41600.38455975693, 34379.79505825023, 29348.680842359154, 30674.019184850084, 36438.884319303455, 42592.755250922644, 42370.3178916497, 42156.89423782389, 35502.775577422006, 32961.0702792199, 33247.123325205226, 32638.287197185437, 44674.20771450234, 51377.68957910347, 50811.51554085845, 40938.066497686414, 34693.61195056415, 35794.33867394901, 43608.768716936305, 53030.999471673196, 55375.512665613925, 51350.83787518498, 43515.67353483048, 36881.67381692623, 38068.73656246778, 46093.29078747657, 57268.0460688599, 58937.23137518557, 54242.503589531414, 45218.09204062047, 41196.31936112136, 42541.41388049561, 41024.48650669875, 51622.7076655413, 56013.155747602264, 52544.45187791683, 43851.12531567635, 38604.27886985339, 38620.02900914742, 47969.72289541592, 56852.747992644305, 61996.47126627607, 58805.310177720465, 47280.70456188186, 40583.34097614553, 41128.192126370894, 50193.55312608201, 59539.43834004759, 64983.864246308556, 61279.46570907009, 48338.44203029854, 40484.3484226722, 41884.39600353677, 54289.35228313394, 61250.691081910205, 64220.31437336595, 59165.983057045596, 47642.189399246454, 41115.51194818619, 41876.72414814975, 51464.90978190481, 62803.30863417166, 65385.57373290332, 60334.06981497588, 46284.58686223134, 38833.89407337377, 39698.05357331142, 47745.00067522588, 55101.85604010965, 58237.24426215552, 57764.44135307596, 50420.8551702024, 45567.72968395969, 46504.94127248667, 54349.34998968483, 60731.4138465725, 63303.91498179767, 60298.72859863081, 51704.17508237814, 45717.347342456975, 46059.70474391559, 53049.488500429776, 59306.32398907892, 62403.524245874905, 59696.274176988816, 48684.76124917582, 41835.26632192022, 42430.54555033094, 50076.369117078095, 56940.56608534447, 58025.69842660899, 55038.43675221109, 44884.821827167245, 38939.787615900466, 38979.59802389587, 44663.55176765817, 50517.54060893932, 50345.85708898577, 47988.558150172845, 40853.4057042814, 38165.8415118995, 38648.876806740445, 41908.58157309806, 47401.57907842861, 49481.04013850385, 48095.25972992769, 41284.361139943714, 37854.4480306016, 37546.53779346516, 41603.843693113595, 47820.91079534329, 47246.476111540374, 44466.81067665845, 39450.05564794844, 37010.325669384845, 36654.85544497009, 40353.082727944166, 46185.06388406285, 47580.77496864459, 46665.410664712144, 40791.501556263196, 38355.37341411503, 37635.23471064154, 36758.08397128221, 45985.88943874199, 49995.80097644232, 47975.90718646997, 41429.81245724381, 37282.35082487397, 37137.63721104596, 42201.7547964592, 49822.27312845655, 54534.49980287231, 54533.77882552861, 47481.81951410114, 42055.08936215232, 43879.59319217621, 50025.5866728274, 57779.83356382646, 62229.99530852202, 60656.63284916041, 53322.422185201416, 46947.596440932735, 48157.11595496144, 55485.4195651247, 65417.97612369258, 69959.13505419965, 68457.41761185907, 58008.38472436904, 49036.129043870984, 48812.32078456241, 55770.81148868941, 62578.40993961545, 64911.32653171455, 63452.51661006169, 56211.79640881274, 49058.582679465755, 49695.71775694223, 54647.5259012805, 59928.6898365422, 54631.05724966697, 49250.29905728087, 40856.97377954754, 36266.89995539015, 35517.15307352361, 34184.53353052385, 33371.86128513576, 31969.554327793263, 30641.723523056826, 28778.69553058833, 27946.73958770959, 27509.81466494675, 26662.70133275635, 26476.786252324444, 24811.176812171496, 23734.407940954658, 21813.02876068876, 19691.25151218245, 17829.182468339233, 15717.002576485738, 14676.555091672217, 13659.528312206226, 12944.24059105674, 11821.33696211924, 10574.619947022518, 9870.736391613336, 8602.998647942173, 8088.018686973378, 7668.737915280734, 7409.212563574305, 7059.414255782104, 6805.311671542087, 6265.577428451043, 5336.939303601046, 5036.213630383211, 4759.937178041844, 5002.634961970787, 5377.02368302538, 5080.424215798721, 4864.875421529681, 4236.1512408381295, 4106.63390017699, 4042.2341857847173, 4197.684686131461, 4474.035273523109, 4490.267528124243, 4241.475197902689, 3589.2927029890307, 3444.298594012123, 3107.614908066812, 3426.8278986551327, 3828.278281872126, 3941.803702165574, 3732.7576911047563, 3225.113307797762, 3072.114284546333, 3150.7921285990747, 3233.947689115616, 3619.052387506853, 3551.5976360602217, 3435.0323606968946, 2973.892113333866, 2674.3168644471334, 2460.238549245212, 2800.034199553288, 3195.2623366970715, 3107.693557143192, 2961.140811696215, 2340.478044336084, 1931.373924738195, 1847.1236388756024, 2179.3253334294473, 2438.0353828364327, 2379.86512657921, 2255.5989701513035, 1870.926018202182, 1620.6083820631166, 1511.3110067191255, 1685.9676158651248, 2099.8497631541054, 2430.5076190841487, 2701.9755190700494, 2997.6300510919987, 2977.3472468469777, 2893.576703677185, 3207.4670223153535, 3539.6180497104797, 3534.997475538599, 3527.721146658791, 3489.9154298884955, 3287.543337245921]
print(performance_df)
I need to access last years revenue and print it next to the current year in a new df column so I can then compare year on year performance.
My desired out put would look something like this
Revenue . LY Revenue
Date
2018-01-01 25891.846787 . Nan
2018-01-02 25851.615541 . Nan
2018-01-03 25037.711900 . Nan
2018-01-04 26715.764965 . Nan
2018-01-05 23988.356950 . Nan
... ...
2019-12-27 3539.618050 . 25744.075480
2019-12-28 3534.997476 . 27119.697589
2019-12-29 3527.721147 . 28894.626077
2019-12-30 3489.915430 . 30321.364425
2019-12-31 3287.543337 . 29665.558703
how you go about achieving this?
So far I have just been able to get last years date from the index by:
performance_df['Last year dates'] = (performance_df['Revenue'].index - pd.Timedelta(days=365))
But I want to corresponding revenue for that date not just the date its self.
Use Series.shift:
performance_df['LY Revenue']=performance_df['Revenue'].shift(365)
print(performance_df)
Revenue LY Revenue
Date
2018-01-01 25891.8% nan%
2018-01-02 25851.6% nan%
2018-01-03 25037.7% nan%
2018-01-04 26715.8% nan%
2018-01-05 23988.4% nan%
... ... ...
2019-12-27 3539.6% 25744.1%
2019-12-28 3535.0% 27119.7%
2019-12-29 3527.7% 28894.6%
2019-12-30 3489.9% 30321.4%
2019-12-31 3287.5% 29665.6%
[730 rows x 2 columns]
Here you can see the beginning of the year 2019:
print(performance_df[364:366])
Revenue LY Revenue
Date
2018-12-31 29665.6% nan%
2019-01-01 28601.7% 25891.8%
IIUC, you need this.
This works only when you have date time as index.
What we are doing here is grouping by day & month using the datetime value & even if the dates are between leap-year & a normal-year, this should work.
performance_df['LY_Revenue'] = performance_df.groupby([performance_df.index.month,performance_df.index.day])['Revenue'].shift()
print(performance_df)
Output
Revenue LY_Revenue
Date
2018-01-01 25891.846787 NaN
2018-01-02 25851.615541 NaN
2018-01-03 25037.711900 NaN
2018-01-04 26715.764965 NaN
2018-01-05 23988.356950 NaN
2018-01-06 19029.057983 NaN
2018-01-07 16935.481705 NaN
2018-01-08 22756.072913 NaN
2018-01-09 30385.672829 NaN
2018-01-10 32970.132175 NaN
2018-01-11 31089.167075 NaN
2018-01-12 24262.972415 NaN
2018-01-13 18261.273832 NaN
2018-01-14 18304.754084 NaN
2018-01-15 26297.835665 NaN
2018-01-16 32619.669405 NaN
2018-01-17 35565.262225 NaN
2018-01-18 33229.971940 NaN
2018-01-19 25405.647136 NaN
2018-01-20 19980.890375 NaN
2018-01-21 20487.553161 NaN
2018-01-22 29709.032322 NaN
2018-01-23 38164.493648 NaN
2018-01-24 39050.801147 NaN
2018-01-25 36612.554433 NaN
2018-01-26 28169.782524 NaN
2018-01-27 22086.641618 NaN
2018-01-28 21631.662706 NaN
2018-01-29 28419.945290 NaN
2018-01-30 35644.617364 NaN
... ... ...
2019-12-02 2973.892113 28289.697207
2019-12-03 2674.316864 34737.317368
2019-12-04 2460.238549 40574.910348
2019-12-05 2800.034200 40556.066887
2019-12-06 3195.262337 39927.322507
2019-12-07 3107.693557 34634.748383
2019-12-08 2961.140812 27666.467364
2019-12-09 2340.478044 27774.363832
2019-12-10 1931.373925 33950.846875
2019-12-11 1847.123639 39518.061312
2019-12-12 2179.325333 39587.568701
2019-12-13 2438.035383 38832.660311
2019-12-14 2379.865127 32258.462222
2019-12-15 2255.598970 23343.008315
2019-12-16 1870.926018 23914.895775
2019-12-17 1620.608382 28173.094175
2019-12-18 1511.311007 30306.555827
2019-12-19 1685.967616 28284.310392
2019-12-20 2099.849763 24228.754426
2019-12-21 2430.507619 20495.999365
2019-12-22 2701.975519 19302.936445
2019-12-23 2997.630051 21391.090777
2019-12-24 2977.347247 21072.220129
2019-12-25 2893.576704 19770.681250
2019-12-26 3207.467022 22751.205447
2019-12-27 3539.618050 25744.075480
2019-12-28 3534.997476 27119.697589
2019-12-29 3527.721147 28894.626077
2019-12-30 3489.915430 30321.364425
2019-12-31 3287.543337 29665.558703
Given your data is time-indexed, you can shift with freq
performance_df['LY Revenue'] = performance_df.Revenue.shift(freq='365d')
Output:
Revenue LY Revenue
Date
2018-01-01 25891.8% nan%
2018-01-02 25851.6% nan%
2018-01-03 25037.7% nan%
2018-01-04 26715.8% nan%
2018-01-05 23988.4% nan%
2018-01-06 19029.1% nan%
2018-01-07 16935.5% nan%
2018-01-08 22756.1% nan%
2018-01-09 30385.7% nan%
...
2019-12-21 2430.5% 20496.0%
2019-12-22 2702.0% 19302.9%
2019-12-23 2997.6% 21391.1%
2019-12-24 2977.3% 21072.2%
2019-12-25 2893.6% 19770.7%
2019-12-26 3207.5% 22751.2%
2019-12-27 3539.6% 25744.1%
2019-12-28 3535.0% 27119.7%
2019-12-29 3527.7% 28894.6%
2019-12-30 3489.9% 30321.4%
2019-12-31 3287.5% 29665.6%
But note that 365D is not necessarily a year in general.
Hard to confirm without the data, but my thoughts are using .loc
pdf = performance_df
pdf.loc['2018-xx-xx','Revenue'] - pdf.loc['2019-xx-xx','Revenue']
May be able to simply use:
pdf.loc['2018','Revenue'] - pdf.loc['2019','Revenue']
Of course you can also simply slice the years into df's, and then subtract/divide
pdf18 = pdf.loc['2018-01-01':'2019-01-01','Revenue']
pdf19 = pdf.loc['2019-01-01':'2020-01-01','Revenue']
pdf_diff = pdf18 - pdf19
# = (pdf19-pdf18)/pdf18
#Or slice just year, then subtract pdf18['Revenue'] - pdf19['Revenue']
This may have issues because the shape[0]'s won't match (not full year), so make sure they line up.

Data cleaning and preparation for Time-Series-LSTM

I need to prepare my Data to feed it into an LSTM for predicting the next day.
My Dataset is a time series in seconds but I have just 3-5 hours a day of Data. (I just have this specific Dataset so can't change it)
I have Date-Time and a certain Value.
E.g.:
datetime..............Value
2015-03-15 12:00:00...1000
2015-03-15 12:00:01....10
.
.
I would like to write a code where I extract e.g. 4 hours and delete the first extracted hour just for specific months (because this data is faulty).
I managed to write a code to extract e.g. 2 hours for x-Data (Input) and y-Data (Output).
I hope I could explain my problem to you.
The Dataset is 1 Year in seconds Data, 6pm-11pm rest is missing.
In e.g. August-November the first hour is faulty data and needs to be deleted.
init = True
for day in np.unique(x_df.index.date):
temp = x_df.loc[(day + pd.DateOffset(hours=18)):(day + pd.DateOffset(hours=20))]
if len(temp) == 7201:
if init:
x_df1 = np.array([temp.values])
init = False
else:
#print (temp.values.shape)
x_df1 = np.append(x_df1, np.array([temp.values]), axis=0)
#else:
#if not temp.empty:
#print (temp.index[0].date(), len(temp))
x_df1 = np.array(x_df1)
print('X-Shape:', x_df1.shape,
'Y-Shape:', y_df1.shape)
#sample, timesteps and features for LSTM
X-Shape: (32, 7201, 6) Y-Shape: (32, 7201)
My expected result is to have a dataset of e.g. 4 hours a day where the first hour in e.g. August, September, and October is deleted.
I would be also very happy if there is someone who can also provide me with a nicer code to do so.
Probably not the most efficient solution, but maybe it still fits.
First lets generate some random data for the first 4 months and 5 days per month:
import random
import pandas as pd
df = pd.DataFrame()
for month in range(1,5): #First 4 Months
for day in range(5,10): #5 Days
hour = random.randint(18,19)
minute = random.randint(1,59)
dt = datetime.datetime(2018,month,day,hour,minute,0)
dti = pd.date_range(dt, periods=60*60*4, freq='S')
values = [random.randrange(1, 101, 1) for _ in range(len(dti))]
df = df.append(pd.DataFrame(values, index=dti, columns=['Value']))
Now let's define a function to filter the first row per day:
def first_value_per_day(df):
res_df = df.groupby(df.index.date).apply(lambda x: x.iloc[[0]])
res_df.index = res_df.index.droplevel(0)
return res_df
and print the results:
print(first_value_per_day(df))
Value
2018-01-05 18:31:00 85
2018-01-06 18:25:00 40
2018-01-07 19:54:00 52
2018-01-08 18:23:00 46
2018-01-09 18:08:00 51
2018-02-05 18:58:00 6
2018-02-06 19:12:00 16
2018-02-07 18:18:00 10
2018-02-08 18:32:00 50
2018-02-09 18:38:00 69
2018-03-05 19:54:00 100
2018-03-06 18:37:00 70
2018-03-07 18:58:00 26
2018-03-08 18:28:00 30
2018-03-09 18:34:00 71
2018-04-05 18:54:00 2
2018-04-06 19:16:00 100
2018-04-07 18:52:00 85
2018-04-08 19:08:00 66
2018-04-09 18:11:00 22
So, now we need a list of the specific months, that should be processed, in this case 2 and 3. Now we use the defined function and filter the days for every selected month and loop over those to find the indexes of all values inside the first entry per day +1 hour later and drop them:
MONTHS_TO_MODIFY = [2,3]
HOURS_TO_DROP = 1
fvpd = first_value_per_day(df)
for m in MONTHS_TO_MODIFY:
fvpdm = fvpd[fvpd.index.month == m]
for idx, value in fvpdm.iterrows():
start_dt = idx
end_dt = idx + datetime.timedelta(hours=HOURS_TO_DROP)
index_list = df[(df.index >= start_dt) & (df.index < end_dt)].index.tolist()
df.drop(index_list, inplace=True)
result:
print(first_value_per_day(df))
Value
2018-01-05 18:31:00 85
2018-01-06 18:25:00 40
2018-01-07 19:54:00 52
2018-01-08 18:23:00 46
2018-01-09 18:08:00 51
2018-02-05 19:58:00 1
2018-02-06 20:12:00 42
2018-02-07 19:18:00 34
2018-02-08 19:32:00 34
2018-02-09 19:38:00 61
2018-03-05 20:54:00 15
2018-03-06 19:37:00 88
2018-03-07 19:58:00 36
2018-03-08 19:28:00 38
2018-03-09 19:34:00 42
2018-04-05 18:54:00 2
2018-04-06 19:16:00 100
2018-04-07 18:52:00 85
2018-04-08 19:08:00 66
2018-04-09 18:11:00 22

pandas_datareader.data not returning all stock values from start to end date

I am trying to get stock data from yahoo using pandas_datareader.data and i keep getting missing sections of data. here is what i have coded. all i want to do right now is return all the data for the dates between the start and end dates
import pandas as pd
import pandas_datareader.data as web
from datetime import datetime
ibm = web.DataReader('IBM', 'yahoo', datetime(2015,1,1)
datetime(2016,1,1))
and right now this is returning this:
data
I am confused why i am getting the ellipse with all the missing data when i try to create my set. any help would be greatly appreciated!
This is how pandas displays the result (as explained here). pandas omits rows that exceed the pd.set_option('max_rows', X) setting (default is 50 I believe). You can see the limit using pd.options.display.max_rows.
Try ibm.info() and you should see that there are more rows than displayed.
Your query results in:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 252 entries, 2015-01-02 to 2015-12-31
Data columns (total 6 columns):
Open 252 non-null float64
High 252 non-null float64
Low 252 non-null float64
Close 252 non-null float64
Volume 252 non-null int64
Adj Close 252 non-null float64
dtypes: float64(5), int64(1)
memory usage: 13.8 KB
None
But displays as (not rows x columns info at the bottom despite elipsis):
Open High Low Close Volume \
Date
2015-01-02 161.309998 163.309998 161.000000 162.059998 5525500
2015-01-05 161.270004 161.270004 159.190002 159.509995 4880400
2015-01-06 159.669998 159.960007 155.169998 156.070007 6146700
2015-01-07 157.199997 157.199997 154.029999 155.050003 4701800
2015-01-08 156.240005 159.039993 155.550003 158.419998 4236800
2015-01-09 158.419998 160.339996 157.250000 159.110001 4484800
2015-01-12 159.000000 159.250000 155.759995 156.440002 4182800
2015-01-13 157.259995 159.970001 155.679993 156.809998 4377500
2015-01-14 154.860001 156.490005 153.740005 155.800003 4690300
2015-01-15 156.690002 156.970001 154.160004 154.570007 4248400
2015-01-16 153.820007 157.630005 153.820007 157.139999 5756000
2015-01-20 156.699997 157.330002 154.029999 156.949997 8392800
2015-01-21 153.029999 154.500000 151.070007 152.089996 11897100
2015-01-22 151.940002 155.720001 151.759995 155.389999 6120100
2015-01-23 155.029999 157.600006 154.889999 155.869995 4834800
2015-01-26 158.259995 159.460007 155.770004 156.360001 7888100
2015-01-27 154.940002 155.089996 152.589996 153.669998 5659600
2015-01-28 154.000000 154.529999 151.550003 151.550003 4495900
2015-01-29 151.380005 155.580002 149.520004 155.479996 8320800
2015-01-30 153.910004 155.240005 153.039993 153.309998 6563600
2015-02-02 154.000000 154.660004 151.509995 154.660004 4712200
2015-02-03 154.750000 158.600006 154.750000 158.470001 5539400
2015-02-04 157.210007 158.710007 156.699997 156.960007 3678500
2015-02-05 157.289993 158.589996 157.149994 157.910004 5253600
2015-02-06 157.339996 158.080002 156.229996 156.720001 3225000
2015-02-09 156.000000 157.500000 155.399994 155.750000 3057700
2015-02-10 156.740005 158.559998 155.080002 158.559998 4440600
2015-02-11 157.759995 159.089996 157.169998 158.199997 3626700
2015-02-12 158.720001 159.500000 158.089996 158.520004 3333100
2015-02-13 158.779999 160.800003 158.639999 160.399994 3706900
... ... ... ... ... ...
2015-11-18 134.789993 135.910004 134.259995 135.820007 4149200
2015-11-19 136.210007 137.740005 136.009995 136.740005 4753600
2015-11-20 137.369995 138.919998 137.250000 138.500000 5176400
2015-11-23 138.529999 138.869995 137.119995 138.460007 5137900
2015-11-24 137.649994 139.339996 137.309998 138.600006 3407700
2015-11-25 138.369995 138.429993 137.380005 138.000000 3238200
2015-11-27 138.000000 138.809998 137.210007 138.460007 1415800
2015-11-30 138.610001 139.899994 138.520004 139.419998 4545600
2015-12-01 139.580002 141.399994 139.580002 141.279999 4195100
2015-12-02 140.929993 141.210007 139.500000 139.699997 3725400
2015-12-03 140.100006 140.729996 138.190002 138.919998 5909600
2015-12-04 138.089996 141.020004 137.990005 140.429993 4571600
2015-12-07 140.160004 140.410004 138.809998 139.550003 3279400
2015-12-08 138.279999 139.059998 137.529999 138.050003 3905200
2015-12-09 137.380005 139.839996 136.229996 136.610001 4615000
2015-12-10 137.029999 137.850006 135.720001 136.779999 4222300
2015-12-11 135.229996 135.440002 133.910004 134.570007 5333800
2015-12-14 135.309998 136.139999 134.020004 135.929993 5143800
2015-12-15 137.399994 138.970001 137.279999 137.789993 4207900
2015-12-16 139.119995 139.649994 137.789993 139.289993 4345500
2015-12-17 139.350006 139.500000 136.309998 136.750000 4089500
2015-12-18 136.410004 136.960007 134.270004 134.899994 10026100
2015-12-21 135.830002 135.830002 134.020004 135.500000 5617500
2015-12-22 135.880005 138.190002 135.649994 137.929993 4263800
2015-12-23 138.300003 139.309998 138.110001 138.539993 5164900
2015-12-24 138.429993 138.880005 138.110001 138.250000 1495200
2015-12-28 137.740005 138.039993 136.539993 137.610001 3143400
2015-12-29 138.250000 140.059998 138.199997 139.779999 3943700
2015-12-30 139.580002 140.440002 139.220001 139.339996 2989400
2015-12-31 139.070007 139.100006 137.570007 137.619995 3462100
Adj Close
Date
2015-01-02 153.863588
2015-01-05 151.442555
2015-01-06 148.176550
2015-01-07 147.208134
2015-01-08 150.407687
2015-01-09 151.062791
2015-01-12 148.527832
2015-01-13 148.879114
2015-01-14 147.920202
2015-01-15 146.752415
2015-01-16 149.192426
2015-01-20 149.012033
2015-01-21 144.397834
2015-01-22 147.530934
2015-01-23 147.986654
2015-01-26 148.451876
2015-01-27 145.897925
2015-01-28 143.885151
2015-01-29 147.616379
2015-01-30 145.556132
2015-02-02 146.837859
2015-02-03 150.455161
2015-02-04 149.021536
2015-02-05 149.923486
2015-02-06 149.837432
2015-02-09 148.910029
2015-02-10 151.596622
2015-02-11 151.252431
2015-02-12 151.558385
2015-02-13 153.355812
... ...
2015-11-18 133.161622
2015-11-19 134.063613
2015-11-20 135.789160
2015-11-23 135.749949
2015-11-24 135.887208
2015-11-25 135.298946
2015-11-27 135.749949
2015-11-30 136.691151
2015-12-01 138.514746
2015-12-02 136.965669
2015-12-03 136.200937
2015-12-04 137.681377
2015-12-07 136.818611
2015-12-08 135.347970
2015-12-09 133.936153
2015-12-10 134.102824
2015-12-11 131.936088
2015-12-14 133.269455
2015-12-15 135.093050
2015-12-16 136.563691
2015-12-17 134.073412
2015-12-18 132.259616
2015-12-21 132.847878
2015-12-22 135.230309
2015-12-23 135.828370
2015-12-24 135.544053
2015-12-28 134.916580
2015-12-29 137.044105
2015-12-30 136.612715
2015-12-31 134.926379
[252 rows x 6 columns]

Time Delta of Year Excluding Certain Days

I am making a heat map that has Company Name on the x axis, months on the y-axis, and shaded regions as the number of calls.
I am taking a slice of data from a database for the past year in order to create the heat map. However, this means that if you hover over the current month, say for example today is July 13, you will get the calls of July 1-13 of this year, and the calls of July 13-31 from last year added together. In the current month, I only want to show calls from July 1-13.
#This section selects the last year of data
# convert strings to datetimes
df['recvd_dttm'] = pd.to_datetime(df['recvd_dttm'])
#Only retrieve data before now (ignore typos that are future dates)
mask = df['recvd_dttm'] <= datetime.datetime.now()
df = df.loc[mask]
# get first and last datetime for final week of data
range_max = df['recvd_dttm'].max()
range_min = range_max - datetime.timedelta(days=365)
# take slice with final week of data
df = df[(df['recvd_dttm'] >= range_min) &
(df['recvd_dttm'] <= range_max)]
You can use the pd.tseries.offsets.MonthEnd() to achieve your goal here.
import pandas as pd
import numpy as np
import datetime as dt
np.random.seed(0)
val = np.random.randn(600)
date_rng = pd.date_range('2014-01-01', periods=600, freq='D')
df = pd.DataFrame(dict(dates=date_rng,col=val))
print(df)
col dates
0 1.7641 2014-01-01
1 0.4002 2014-01-02
2 0.9787 2014-01-03
3 2.2409 2014-01-04
4 1.8676 2014-01-05
5 -0.9773 2014-01-06
6 0.9501 2014-01-07
7 -0.1514 2014-01-08
8 -0.1032 2014-01-09
9 0.4106 2014-01-10
.. ... ...
590 0.5433 2015-08-14
591 0.4390 2015-08-15
592 -0.2195 2015-08-16
593 -1.0840 2015-08-17
594 0.3518 2015-08-18
595 0.3792 2015-08-19
596 -0.4700 2015-08-20
597 -0.2167 2015-08-21
598 -0.9302 2015-08-22
599 -0.1786 2015-08-23
[600 rows x 2 columns]
print(df.dates.dtype)
datetime64[ns]
datetime_now = dt.datetime.now()
datetime_now_month_end = datetime_now + pd.tseries.offsets.MonthEnd(1)
print(datetime_now_month_end)
2015-07-31 03:19:18.292739
datetime_start = datetime_now_month_end - pd.tseries.offsets.DateOffset(years=1)
print(datetime_start)
2014-07-31 03:19:18.292739
print(df[(df.dates > datetime_start) & (df.dates < datetime_now)])
col dates
212 0.7863 2014-08-01
213 -0.4664 2014-08-02
214 -0.9444 2014-08-03
215 -0.4100 2014-08-04
216 -0.0170 2014-08-05
217 0.3792 2014-08-06
218 2.2593 2014-08-07
219 -0.0423 2014-08-08
220 -0.9559 2014-08-09
221 -0.3460 2014-08-10
.. ... ...
550 0.1639 2015-07-05
551 0.0963 2015-07-06
552 0.9425 2015-07-07
553 -0.2676 2015-07-08
554 -0.6780 2015-07-09
555 1.2978 2015-07-10
556 -2.3642 2015-07-11
557 0.0203 2015-07-12
558 -1.3479 2015-07-13
559 -0.7616 2015-07-14
[348 rows x 2 columns]

Categories

Resources