Add Future dates for missing Rows in a Dataframe - python

How to impute the Missed dates with next Dates in a data frame?
wtg_at1.tail(10)
AmbientTemperatue
Date
818
31.237499
2020-03-28
819
32.865974
2020-03-29
820
32.032558
2020-03-30
821
31.671166
NaN
822
31.389927
NaN
823
31.243660
NaN
824
31.206777
NaN
825
31.241503
NaN
826
31.309531
NaN
827
31.382531
NaN
I am expecting my output data frame something similar to below. After 30th March, I am expecting next dates from 31st March.
AmbientTemperatue
Date
818
31.237499
2020-03-28
819
32.865974
2020-03-29
820
32.032558
2020-03-30
821
31.671166
2020-03-31
822
31.389927
2020-04-01
823
31.243660
2020-04-02
824
31.206777
2020-04-03
825
31.241503
2020-04-04
826
31.309531
2020-04-05
827
31.382531
2020-04-06
I tried below code but not giving desired output.
wtg_at1.append(pd.DataFrame({'Date': pd.date_range(start=wtg_at1.Date.iloc[-8], periods=7, freq='D', closed='right')}))
wtg_at1
AmbientTemperatue
Date
0
32.032558
2017-12-31
1
26.667757
2018-01-01
2
25.655754
2018-01-02
3
25.514013
2018-01-03
4
24.927652
2018-01-04
...
...
...
823
31.243660
NaN
824
31.206777
NaN
825
31.241503
NaN
826
31.309531
NaN
827
31.382531
NaN

If there is only one group of missing values is possible forward filling them and add counter by cumulative sum converted to days timedeltas:
df['Date'] = pd.to_datetime(df['Date'])
df['Date'] = df['Date'].ffill() + pd.to_timedelta(df['Date'].isna().cumsum(), unit='d')
print (df)
AmbientTemperatue Date
818 31.237499 2020-03-28
819 32.865974 2020-03-29
820 32.032558 2020-03-30
821 31.671166 2020-03-31
822 31.389927 2020-04-01
823 31.243660 2020-04-02
824 31.206777 2020-04-03
825 31.241503 2020-04-04
826 31.309531 2020-04-05
827 31.382531 2020-04-06
Another possible idea is reassign values by minimal datetime and length of DataFrame:
df['Date'] = pd.date_range(df['Date'].min(), periods=len(df))
If there is multiple groups with missing values:
print (df)
AmbientTemperatue Date
818 31.237499 2020-03-28
819 32.865974 2020-03-29
820 32.032558 2020-03-30
821 31.671166 NaN
822 31.389927 NaN
823 31.243660 NaN
824 31.206777 2020-05-08
825 31.241503 NaN
826 31.309531 NaN
827 31.382531 NaN
df['Date'] = pd.to_datetime(df['Date'])
m = df['Date'].notna()
s = (~m).groupby(m.cumsum()).cumsum()
df['Date'] = df['Date'].ffill() + pd.to_timedelta(s, unit='d')
print (df)
AmbientTemperatue Date
818 31.237499 2020-03-28
819 32.865974 2020-03-29
820 32.032558 2020-03-30
821 31.671166 2020-03-31
822 31.389927 2020-04-01
823 31.243660 2020-04-02
824 31.206777 2020-05-08
825 31.241503 2020-05-09
826 31.309531 2020-05-10
827 31.382531 2020-05-11

Related

pandas.to_datetime not converting all rows to datetime

simple transformation to convert a string date time to datetime in a df not working - please see last column 990 onwards
new_df = pd.melt(
frame=df,
id_vars={'Date', 'Day'}
)
new_df['new_date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y', errors='raise')
Date Day variable value new_date
0 1/5/2015 289 Cases_Guinea 2776.0 2015-01-05
1 1/4/2015 288 Cases_Guinea 2775.0 2015-01-04
2 1/3/2015 287 Cases_Guinea 2769.0 2015-01-03
3 1/2/2015 286 Cases_Guinea NaN 2015-01-02
4 12/31/2014 284 Cases_Guinea 2730.0 2014-12-31
5 12/28/2014 281 Cases_Guinea 2706.0 2014-12-28
6 12/27/2014 280 Cases_Guinea 2695.0 2014-12-27
7 12/24/2014 277 Cases_Guinea 2630.0 2014-12-24
8 12/21/2014 273 Cases_Guinea 2597.0 2014-12-21
9 12/20/2014 272 Cases_Guinea 2571.0 2014-12-20
.. ... ... ... ... ...
990 12/3/2014 256 Deaths_Guinea NaN NaT
991 11/30/2014 253 Deaths_Guinea 1327.0 NaT
992 11/28/2014 251 Deaths_Guinea NaN NaT
993 11/23/2014 246 Deaths_Guinea 1260.0 NaT
994 11/22/2014 245 Deaths_Guinea NaN NaT
995 11/18/2014 241 Deaths_Guinea 1214.0 NaT
996 11/16/2014 239 Deaths_Guinea 1192.0 NaT
997 11/15/2014 238 Deaths_Guinea NaN NaT

Python: calculate rolling returns over different frequencies

I have the following DataFrame:
data = [[99330,12,122],[1123,1230,1287],[123,101,812739],[1143,1230123,252],[234,342,4546],[2445,3453,3457],[7897,8657,5675],[46,5675,453],[76,484,3735],[363,93,4568],[385,568,367],[458,846,4847],[574,45747,658468],[57457,46534,4675]]
df1 = pd.DataFrame(data, index=['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
'2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
'2022-01-09', '2022-01-10', '2022-01-11', '2022-01-12',
'2022-01-13', '2022-01-14'],
columns=['col_A', 'col_B', 'col_C'])
df1.index = pd.to_datetime(df1.index)
df1:
col_A col_B col_C
2022-01-01 99330 12 122
2022-01-02 1123 1230 1287
2022-01-03 123 101 812739
2022-01-04 1143 1230123 252
2022-01-05 234 342 4546
2022-01-06 2445 3453 3457
2022-01-07 7897 8657 5675
2022-01-08 46 5675 453
2022-01-09 76 484 3735
2022-01-10 363 93 4568
2022-01-11 385 568 367
2022-01-12 458 846 4847
2022-01-13 574 45747 658468
2022-01-14 57457 46534 4675
I am applying the following to get rolling returns:
periodicity_dict = {1:'daily', 7:'weekly'}
df_columns = df1.columns
for key in periodicity_dict:
for col in df_columns:
df1[col+'_rolling']= np.nan
for i in range(key, len(df1[col][df1[col].first_valid_index():df1[col].last_valid_index()])):
df1[col+'_rolling'].iloc[i] = (df1[col].iloc[i] - df[col].iloc[i-key])/df[col].iloc[i-key]
But I am getting the following error: KeyError: 'col_A'.
What am I doing wrong? And is there a better way to do this with less loops?
I think you are looking for something like the shift method (no for-loop is needed):
df1['col_A_rolling'] = (df1['col_A'] - df1['col_A'].shift(7)) / df1['col_A'].shift(7)
OUTPUT:
col_A col_B col_C col_A_rolling
2022-01-01 99330 12 122 NaN
2022-01-02 1123 1230 1287 NaN
2022-01-03 123 101 812739 NaN
2022-01-04 1143 1230123 252 NaN
2022-01-05 234 342 4546 NaN
2022-01-06 2445 3453 3457 NaN
2022-01-07 7897 8657 5675 NaN
2022-01-08 46 5675 453 -0.999537
2022-01-09 76 484 3735 -0.932324
2022-01-10 363 93 4568 1.951220
2022-01-11 385 568 367 -0.663167
2022-01-12 458 846 4847 0.957265
2022-01-13 574 45747 658468 -0.765235
2022-01-14 57457 46534 4675 6.275801

Drop rows with column value X and time difference less than Y

I checked this post but couldnt get to my solution.
I have a dataframe which I filtered to get rows where df[df.columntype == 'B'] like below. Also, df.timeframe is of type datetime64[ns]
timeframe columntype
292 2021-05-19 10:17:00 B
293 2021-05-19 10:18:00 B
294 2021-05-19 10:18:00 B
295 2021-05-19 10:18:00 B
296 2021-05-19 10:18:00 B
418 2021-05-25 09:49:00 B
419 2021-05-25 09:49:00 B
420 2021-05-25 09:50:00 B
659 2021-07-08 10:33:00 B
660 2021-07-08 10:33:00 B
661 2021-07-08 10:33:00 B
I want to drop rows where time difference is less than 5 minutes. So I would get:
timeframe columntype
292 2021-05-19 10:17:00 B
418 2021-05-25 09:49:00 B
659 2021-07-08 10:33:00 B
How can I do this?
I would try to do it with the method diff:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.diff.html
Then filter on a condition where df.timeframe.diff().abs() < x
Hope it helped

Why is Pandas resample sampling out of sample?

I've got an issue with pandas resample function when trying resample a time series. My program fetches daily traffic data two years back from today and populates it in a .csv file. Resampling the data initially functioned well but recently it has started acting up. When I try to resample the daily data into weekly, monthly or quarterly frequency, pandas seems to randomly give out-of sample (non-existent) data from both sides of the actual range.
I first create a Pandas data frame from the csv file:
data = pd.read_csv('Trucks.csv')
data['Date'] = pd.to_datetime(data['Date'], infer_datetime_format=True)
data.set_index('Date',inplace=True)
data['Modified Total Trucks'] = data['Modified Total Trucks'].astype(int)
Here's a sample of the data:
Date Total Trucks Modified Total Trucks Solo Trucks Semi Trucks Full Trucks
2020-07-04 3898 2535 805 2281 812
2020-06-04 4125 2740 927 2378 820
2020-05-04 730 569 234 431 65
2020-04-04 465 354 145 270 50
2020-03-04 3501 2377 812 2051 638
2020-02-04 3594 2334 754 2081 759
...
2018-04-13 3243 2333 819 1978 446
2018-12-04 3402 2394 767 2144 491
2018-11-04 3559 2543 859 2209 491
2018-10-04 3492 2473 813 2182 497
2018-09-04 3733 2672 902 2321 510
I then try to resample the data:
DataWeekly = data.resample('1W').sum()
DataMonthly = data.resample('1M').sum()
DataQuarterly = data.resample('1Q').sum()
However, the resampled data frames have the wrong range and sometimes incorrect values. Here's an example of the monthly set:
Date Total Trucks Modified Total Trucks Solo Trucks Semi Trucks Full Trucks
2018-01-31 15553 11119 3842 9531 2180
2018-02-28 18488 13113 4497 11291 2700
2018-03-31 21355 15177 5134 13176 3045
2018-04-30 67785 48478 16524 41893 9368
2018-05-31 72390 51690 17666 44594 10130
2018-06-30 63877 45356 14938 40000 8939
2018-07-31 64846 46437 16108 39703 9035
2018-08-31 68352 49036 16905 42081 9366
2018-09-30 64629 46379 15963 39842 8824
2018-10-31 68093 48609 16806 41643 9644
2018-11-30 74643 53052 18581 45073 10989
2018-12-31 60270 43042 15030 36649 8591
2019-01-31 76866 55463 18994 47789 10083
2019-02-28 74705 53744 18170 46674 9861
2019-03-31 78664 56562 19108 49144 10412
2019-04-30 77760 56175 19356 48224 10180
2019-05-31 88033 63219 22049 53859 12125
2019-06-30 70370 50626 17448 43454 9468
2019-07-31 76014 54531 18698 46947 10369
2019-08-31 83509 60418 21600 50653 11256
2019-09-30 77289 55375 19097 47517 10675
2019-10-31 83514 60021 20761 51397 11356
2019-11-30 81383 58460 20550 49551 11282
2019-12-31 68307 49172 17092 41990 9225
2020-01-31 59448 42384 14547 36472 8429
2020-02-29 53862 38544 13687 32457 7718
2020-03-31 62950 43478 14930 37403 10617
2020-04-30 7796 5645 1968 4811 1017
2020-05-31 7983 5840 2053 4951 979
2020-06-30 11200 7918 2785 6710 1705
2020-07-31 10998 7673 2576 6691 1731
2020-08-31 4602 3323 1155 2838 609
2020-09-30 7980 5794 1991 4981 1008
2020-10-31 9759 7060 2464 6012 1283
2020-11-30 7762 5595 1906 4836 1020
2020-12-31 7642 5412 1790 4760 1092
I would expect the resample to be:
2018-04-30 67785 48478 16524 41893 9368
2018-05-31 72390 51690 17666 44594 10130
2018-06-30 63877 45356 14938 40000 8939
2018-07-31 64846 46437 16108 39703 9035
2018-08-31 68352 49036 16905 42081 9366
2018-09-30 64629 46379 15963 39842 8824
2018-10-31 68093 48609 16806 41643 9644
2018-11-30 74643 53052 18581 45073 10989
2018-12-31 60270 43042 15030 36649 8591
2019-01-31 76866 55463 18994 47789 10083
2019-02-28 74705 53744 18170 46674 9861
2019-03-31 78664 56562 19108 49144 10412
2019-04-30 77760 56175 19356 48224 10180
2019-05-31 88033 63219 22049 53859 12125
2019-06-30 70370 50626 17448 43454 9468
2019-07-31 76014 54531 18698 46947 10369
2019-08-31 83509 60418 21600 50653 11256
2019-09-30 77289 55375 19097 47517 10675
2019-10-31 83514 60021 20761 51397 11356
2019-11-30 81383 58460 20550 49551 11282
2019-12-31 68307 49172 17092 41990 9225
2020-01-31 59448 42384 14547 36472 8429
2020-02-29 53862 38544 13687 32457 7718
2020-03-31 62950 43478 14930 37403 10617
2020-04-30 7796 5645 1968 4811 1017
What am I missing? Many thanks in advance!
I think this is a problem with US vs ISO (European) time format, i.e. YYYY-DD-MM vs YYYY-MM-DD, it looks like it reads 2018-01-04 as 4th of January and puts it into the 2018-01-31 block (i.e. January 2018).
you want to set the option dayfirst=True in your pd.to_datetime call, see the Pandas doc for more details.

Python - Statsmodels.tsa.seasonal_decompose - missing values in head and tail of dataframe

I have the following dataframe, that I'm calling "sales_df":
Value
Date
2004-01-01 0
2004-02-01 173
2004-03-01 225
2004-04-01 230
2004-05-01 349
2004-06-01 258
2004-07-01 270
2004-08-01 223
... ...
2015-06-01 218
2015-07-01 215
2015-08-01 233
2015-09-01 258
2015-10-01 252
2015-11-01 256
2015-12-01 188
2016-01-01 70
I want to separate its trend from its seasonal component and for that I use statsmodels.tsa.seasonal_decompose through the following code:
decomp=sm.tsa.seasonal_decompose(sales_df.Value)
df=pd.concat([sales_df,decomp.trend],axis=1)
df.columns=['sales','trend']
This is getting me this:
sales trend
Date
2004-01-01 0 NaN
2004-02-01 173 NaN
2004-03-01 225 NaN
2004-04-01 230 NaN
2004-05-01 349 NaN
2004-06-01 258 NaN
2004-07-01 270 236.708333
2004-08-01 223 248.208333
2004-09-01 243 251.250000
... ... ...
2015-05-01 270 214.416667
2015-06-01 218 215.583333
2015-07-01 215 212.791667
2015-08-01 233 NaN
2015-09-01 258 NaN
2015-10-01 252 NaN
2015-11-01 256 NaN
2015-12-01 188 NaN
2016-01-01 70 NaN
Note that there are 6 NaN's in the start and in the end of the Trend's series.
So I ask, is that right? Why is that happening?
This is expected as seasonal_decompose uses a symmetric moving average by default if the filt argument is not specified (as you did). The frequency is inferred from the time series.
https://searchcode.com/codesearch/view/86129185/

Categories

Resources