Why is Pandas resample sampling out of sample? - python

I've got an issue with pandas resample function when trying resample a time series. My program fetches daily traffic data two years back from today and populates it in a .csv file. Resampling the data initially functioned well but recently it has started acting up. When I try to resample the daily data into weekly, monthly or quarterly frequency, pandas seems to randomly give out-of sample (non-existent) data from both sides of the actual range.
I first create a Pandas data frame from the csv file:
data = pd.read_csv('Trucks.csv')
data['Date'] = pd.to_datetime(data['Date'], infer_datetime_format=True)
data.set_index('Date',inplace=True)
data['Modified Total Trucks'] = data['Modified Total Trucks'].astype(int)
Here's a sample of the data:
Date Total Trucks Modified Total Trucks Solo Trucks Semi Trucks Full Trucks
2020-07-04 3898 2535 805 2281 812
2020-06-04 4125 2740 927 2378 820
2020-05-04 730 569 234 431 65
2020-04-04 465 354 145 270 50
2020-03-04 3501 2377 812 2051 638
2020-02-04 3594 2334 754 2081 759
...
2018-04-13 3243 2333 819 1978 446
2018-12-04 3402 2394 767 2144 491
2018-11-04 3559 2543 859 2209 491
2018-10-04 3492 2473 813 2182 497
2018-09-04 3733 2672 902 2321 510
I then try to resample the data:
DataWeekly = data.resample('1W').sum()
DataMonthly = data.resample('1M').sum()
DataQuarterly = data.resample('1Q').sum()
However, the resampled data frames have the wrong range and sometimes incorrect values. Here's an example of the monthly set:
Date Total Trucks Modified Total Trucks Solo Trucks Semi Trucks Full Trucks
2018-01-31 15553 11119 3842 9531 2180
2018-02-28 18488 13113 4497 11291 2700
2018-03-31 21355 15177 5134 13176 3045
2018-04-30 67785 48478 16524 41893 9368
2018-05-31 72390 51690 17666 44594 10130
2018-06-30 63877 45356 14938 40000 8939
2018-07-31 64846 46437 16108 39703 9035
2018-08-31 68352 49036 16905 42081 9366
2018-09-30 64629 46379 15963 39842 8824
2018-10-31 68093 48609 16806 41643 9644
2018-11-30 74643 53052 18581 45073 10989
2018-12-31 60270 43042 15030 36649 8591
2019-01-31 76866 55463 18994 47789 10083
2019-02-28 74705 53744 18170 46674 9861
2019-03-31 78664 56562 19108 49144 10412
2019-04-30 77760 56175 19356 48224 10180
2019-05-31 88033 63219 22049 53859 12125
2019-06-30 70370 50626 17448 43454 9468
2019-07-31 76014 54531 18698 46947 10369
2019-08-31 83509 60418 21600 50653 11256
2019-09-30 77289 55375 19097 47517 10675
2019-10-31 83514 60021 20761 51397 11356
2019-11-30 81383 58460 20550 49551 11282
2019-12-31 68307 49172 17092 41990 9225
2020-01-31 59448 42384 14547 36472 8429
2020-02-29 53862 38544 13687 32457 7718
2020-03-31 62950 43478 14930 37403 10617
2020-04-30 7796 5645 1968 4811 1017
2020-05-31 7983 5840 2053 4951 979
2020-06-30 11200 7918 2785 6710 1705
2020-07-31 10998 7673 2576 6691 1731
2020-08-31 4602 3323 1155 2838 609
2020-09-30 7980 5794 1991 4981 1008
2020-10-31 9759 7060 2464 6012 1283
2020-11-30 7762 5595 1906 4836 1020
2020-12-31 7642 5412 1790 4760 1092
I would expect the resample to be:
2018-04-30 67785 48478 16524 41893 9368
2018-05-31 72390 51690 17666 44594 10130
2018-06-30 63877 45356 14938 40000 8939
2018-07-31 64846 46437 16108 39703 9035
2018-08-31 68352 49036 16905 42081 9366
2018-09-30 64629 46379 15963 39842 8824
2018-10-31 68093 48609 16806 41643 9644
2018-11-30 74643 53052 18581 45073 10989
2018-12-31 60270 43042 15030 36649 8591
2019-01-31 76866 55463 18994 47789 10083
2019-02-28 74705 53744 18170 46674 9861
2019-03-31 78664 56562 19108 49144 10412
2019-04-30 77760 56175 19356 48224 10180
2019-05-31 88033 63219 22049 53859 12125
2019-06-30 70370 50626 17448 43454 9468
2019-07-31 76014 54531 18698 46947 10369
2019-08-31 83509 60418 21600 50653 11256
2019-09-30 77289 55375 19097 47517 10675
2019-10-31 83514 60021 20761 51397 11356
2019-11-30 81383 58460 20550 49551 11282
2019-12-31 68307 49172 17092 41990 9225
2020-01-31 59448 42384 14547 36472 8429
2020-02-29 53862 38544 13687 32457 7718
2020-03-31 62950 43478 14930 37403 10617
2020-04-30 7796 5645 1968 4811 1017
What am I missing? Many thanks in advance!

I think this is a problem with US vs ISO (European) time format, i.e. YYYY-DD-MM vs YYYY-MM-DD, it looks like it reads 2018-01-04 as 4th of January and puts it into the 2018-01-31 block (i.e. January 2018).
you want to set the option dayfirst=True in your pd.to_datetime call, see the Pandas doc for more details.

Related

Excel [h]:mm duration to pandas timedelta

I am importing data from an Excel worksheet where I have a 'Duration' field displayed in [h]:mm (so that the total number of hours is shown). I understand that underneath, this is simply number of days as a float.
I want to work with this as a timedelta column or similar in a Pandas dataframe but no matter what I do it's dropping any hours over 24 (e.g. the days portion).
Excel data (over 24 hours highlighted):
Pandas import (1d 7h 51m):
BATCH_NO Duration
354 7154 04:36:00
465 7270 06:35:00
466 7271 08:05:00
467 7272 05:54:00
468 7273 09:10:00
472 7277 06:15:00
476 7280 10:23:00
477 7284 06:09:00
499 7313 06:46:00
503 7322 05:27:00
510 7333 14:15:00
515 7335 1900-01-01 07:51:00
516 7338 07:51:00
517 7339 09:00:00
518 7339 05:29:00
519 7339 09:00:00
520 7339 05:29:00
522 7342 12:10:00
525 7343 08:00:00
530 7346 08:25:00
Running a to_datetime conversion simply drops the day (integer) part of the column:
BATCH_NO Duration
354 7154 04:36:00
465 7270 06:35:00
466 7271 08:05:00
467 7272 05:54:00
468 7273 09:10:00
472 7277 06:15:00
476 7280 10:23:00
477 7284 06:09:00
499 7313 06:46:00
503 7322 05:27:00
510 7333 14:15:00
515 7335 07:51:00
516 7338 07:51:00
517 7339 09:00:00
518 7339 05:29:00
519 7339 09:00:00
520 7339 05:29:00
522 7342 12:10:00
525 7343 08:00:00
530 7346 08:25:00
I have tried importing by fixing the dtype as float, but only str or object work - dtype={'Duration': str} works.
float gives the error float() argument must be a string or a number, not 'datetime.time' and even with str or object, Python still thinks the column i a datetime.time
Ideally I do not want to change the Excel source data or export to .csv as in intermediate step.
If I got it correctly, the imported objects are datetime and time with the datetime in Julian calendar.
So you must convert with a custom function:
from datetime import datetime, time, timedelta
def convert(t):
if isinstance(t, time):
t = datetime.combine(datetime.min, t)
delta = t-datetime.min
if delta.days != 0:
delta -= timedelta(days=693594)
return delta
df['Duration'].apply(convert)
Output:
0 0 days 04:36:00
1 0 days 06:35:00
2 0 days 08:05:00
3 0 days 05:54:00
4 0 days 09:10:00
5 0 days 06:15:00
6 0 days 10:23:00
7 0 days 06:09:00
8 0 days 06:46:00
9 0 days 05:27:00
10 0 days 14:15:00
11 1 days 07:51:00 # corrected
12 0 days 07:51:00
13 0 days 09:00:00
14 0 days 05:29:00
15 0 days 09:00:00
...

Find nlargest(2) with corresponding value after groupby

I have a Dataframe as below:
Datetime Volume Price
2020-08-05 09:15:00 1033 504
2020-08-05 09:15:00 1960 516
2020-08-05 09:15:00 1724 520
2020-08-05 09:15:00 1870 540
2020-08-05 09:20:00 1024 576
2020-08-05 09:20:00 1960 548
2020-08-05 09:20:00 1426 526
2020-08-05 09:20:00 1968 518
2020-08-05 09:30:00 1458 511
2020-08-05 09:30:00 1333 534
2020-08-05 09:30:00 1322 555
2020-08-05 09:30:00 1425 567
2020-08-05 09:30:00 1245 598
I want to find top two max Volume with corresponding Price after groupby on Datetime column.
Result Dataframe as below:
Datetime Volume Price
2020-08-05 09:15:00 1960 516
2020-08-05 09:15:00 1870 540
2020-08-05 09:20:00 1960 548
2020-08-05 09:20:00 1968 518
2020-08-05 09:30:00 1858 511
2020-08-05 09:30:00 1925 567
Use sort_values before groupby:
print (df.sort_values("Volume", ascending=False)
.groupby("Datetime").head(2).sort_index())
Datetime Volume Price
1 2020-08-05 09:15:00 1960 516
3 2020-08-05 09:15:00 1870 540
5 2020-08-05 09:20:00 1960 548
7 2020-08-05 09:20:00 1968 518
8 2020-08-05 09:30:00 1458 511
11 2020-08-05 09:30:00 1425 567
using groupby.rank + boolean indexing:
df[df.groupby("Datetime")['Volume'].rank(ascending=False).le(2)]
Datetime Volume Price
1 2020-08-05 09:15:00 1960 516
3 2020-08-05 09:15:00 1870 540
5 2020-08-05 09:20:00 1960 548
7 2020-08-05 09:20:00 1968 518
8 2020-08-05 09:30:00 1458 511
11 2020-08-05 09:30:00 1425 567
Since you mentioned nlargest
out = df.groupby('Datetime',as_index=False).apply(lambda x : x.nlargest(2, columns=['Volume']))

How to calculate moving average using pandas for a daily frequency over 3 years

I have a large dataset and need to calculate rolling returns over 3 years for each date. I am new in pandas and not able to understand that how can I do this using pandas. Below is my sample data frame.
nav_date price
1989 2019-11-29 25.02
2338 2019-11-28 25.22
1991 2019-11-27 25.11
1988 2019-11-26 24.98
1990 2019-11-25 25.06
1978 2019-11-22 24.73
1984 2019-11-21 24.84
1985 2019-11-20 24.90
1980 2019-11-19 24.78
1971 2019-11-18 24.67
1975 2019-11-15 24.69
1970 2019-11-14 24.64
1962 2019-11-13 24.58
1977 2019-11-11 24.73
1976 2019-11-08 24.72
1987 2019-11-07 24.93
1983 2019-11-06 24.84
1979 2019-11-05 24.74
1981 2019-11-04 24.79
1974 2019-11-01 24.68
2337 2019-10-31 24.66
1966 2019-10-30 24.59
1957 2019-10-29 24.47
1924 2019-10-25 24.06
2336 2019-10-24 24.06
1929 2019-10-23 24.10
1923 2019-10-22 24.05
1940 2019-10-18 24.20
1921 2019-10-17 24.05
1890 2019-10-16 23.77
1882 2019-10-15 23.70
1868 2019-10-14 23.52
1860 2019-10-11 23.45
1846 2019-10-10 23.30
1862 2019-10-09 23.46
2335 2019-10-07 23.08
1837 2019-10-04 23.18
1863 2019-10-03 23.47
1873 2019-10-01 23.57
1894 2019-09-30 23.80
1901 2019-09-27 23.88
1916 2019-09-26 24.00
1885 2019-09-25 23.73
1919 2019-09-24 24.04
1925 2019-09-23 24.06
1856 2019-09-20 23.39
1724 2019-09-19 22.22
1773 2019-09-18 22.50
1763 2019-09-17 22.45
1811 2019-09-16 22.83
1825 2019-09-13 22.98
1806 2019-09-12 22.79
1817 2019-09-11 22.90
1812 2019-09-09 22.84
1797 2019-09-06 22.72
1777 2019-09-05 22.52
1776 2019-09-04 22.51
2334 2019-09-03 22.42
1815 2019-08-30 22.88
1798 2019-08-29 22.73
1820 2019-08-28 22.93
1830 2019-08-27 23.05
1822 2019-08-26 22.95
1770 2019-08-23 22.48
1737 2019-08-22 22.30
1794 2019-08-21 22.66
2333 2019-08-20 22.86
1821 2019-08-19 22.93
1819 2019-08-16 22.92
1814 2019-08-14 22.88
However I can do this in simple python but it takes too long to execute. In python I do like this-
start_date = '2019-10-31'
end_date = '2016-10-31' #For 3 years
years = 3
# Now look at each price for all the dates between start_date and end_date for 3 year and #calculate the CAGR and then do the average.
total_returns = 0
for n in range(int((start_date - end_date).days)):
sd = start_date - relativedelta(days=n)
ed = sd - relativedelta(years=years)
returns = (((price_dict['sd']/price_dict['ed']) ** (1 / years)) - 1) * 100
total_returns+=returns
roll_return = total_returns/int((start_date - end_date).days)
I am sure there will be something to get the same output using pandas without making too much iteration since it is getting too slow and takes too much time to execute. Thanks in advance.
You didn't show expected result... In any case, this is just an example and I think you'll understand my approach.
df = pd.DataFrame({
'nav_date': (
'2019-11-29',
'2018-11-29',
'2017-11-29',
'2016-11-29',
'2019-11-28',
'2018-11-28',
'2017-11-28',
'2016-11-28',
),
'price': (
25.02, # <- example of your price(2019-11-29)
25.11,
25.06,
26.50, # <- example of your price(2016-11-29)
30.51,
30.41,
30.31,
30.21,
),
})
df['year'] = ''
# parse year from date string
df['year'] = df['nav_date'].apply(lambda x: x[0:4])
# parse date without year
df['nav_date'] = df['nav_date'].apply(lambda x: x[5:])
# years to columns, prices to rows
df = df.pivot(index='nav_date', columns='year', values='price')
df = pd.DataFrame(df.to_records())
# value calculation by columns...
df['2019'] = ((df['2019'] / df['2016'] * (1 / 3)) - 1) * 100
# df['2018'] = blablabla...
print(df)
Result:
nav_date 2016 2017 2018 2019
0 11-28 30.21 30.31 30.41 -66.335650
1 11-29 26.50 25.06 25.11 -68.528302 # <- your expected value
So you'll have dataframe with calculated values by each day and you can easily do anything with it(avg()/max()/min()/just any manipulations).
Hope this helps.

How to get values for the next month for a selected column from a pandas data frame with date time index

I have the below data frame (date time index, with all working days in us calender)
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
import random
us_bd = CustomBusinessDay(calendar=USFederalHolidayCalendar())
dt_rng = pd.date_range(start='1/1/2018', end='12/31/2018', freq=us_bd)
n1 = [round(random.uniform(20, 35),2) for _ in range(len(dt_rng))]
n2 = [random.randint(100, 200) for _ in range(len(dt_rng))]
df = pd.DataFrame(list(zip(n1,n2)), index=dt_rng, columns=['n1','n2'])
print(df)
n1 n2
2018-01-02 24.78 197
2018-01-03 23.33 176
2018-01-04 33.19 128
2018-01-05 32.49 110
... ... ...
2018-12-26 31.34 173
2018-12-27 29.72 166
2018-12-28 31.07 104
2018-12-31 33.52 184
[251 rows x 2 columns]
For each row in column n1 , how to get values from the same column for the same day of next month? (if value for that exact day is not available (due to weekends or holidays), then should get the value at the next available date. ). I tried using df.n1.shift(21), but its not working as the exact working days at each month differ.
Expected output as below
n1 n2 next_mnth_val
2018-01-02 25.97 184 28.14
2018-01-03 24.94 133 27.65 # three values below are same, because on Feb 2018, the next working day after 2nd is 5th
2018-01-04 23.99 143 27.65
2018-01-05 24.69 182 27.65
2018-01-08 28.43 186 28.45
2018-01-09 31.47 104 23.14
... ... ... ...
2018-12-26 29.06 194 20.45
2018-12-27 29.63 158 20.45
2018-12-28 30.60 148 20.45
2018-12-31 20.45 121 20.45
for December , the next month value should be last value of the data frame ie, value at index 2018-12-31 (20.45).
please help.
This is an interesting problem. I would shift the date by 1 month, then shift it again to the next business day:
df1 = df.copy().reset_index()
df1['new_date'] = df1['index'] + pd.DateOffset(months=1) + pd.offsets.BDay()
df.merge(df1, left_index=True, right_on='new_date')
Output (first 31st days):
n1_x n2_x index n1_y n2_y new_date
0 34.82 180 2018-01-02 29.83 129 2018-02-05
1 34.82 180 2018-01-03 24.28 166 2018-02-05
2 34.82 180 2018-01-04 27.88 110 2018-02-05
3 24.89 186 2018-01-05 25.34 111 2018-02-06
4 31.66 137 2018-01-08 26.28 138 2018-02-09
5 25.30 162 2018-01-09 32.71 139 2018-02-12
6 25.30 162 2018-01-10 34.39 159 2018-02-12
7 25.30 162 2018-01-11 20.89 132 2018-02-12
8 23.44 196 2018-01-12 29.27 167 2018-02-13
12 25.40 153 2018-01-19 28.52 185 2018-02-20
13 31.38 126 2018-01-22 23.49 141 2018-02-23
14 30.90 133 2018-01-23 25.56 145 2018-02-26
15 30.90 133 2018-01-24 23.06 155 2018-02-26
16 30.90 133 2018-01-25 24.95 174 2018-02-26
17 29.39 138 2018-01-26 21.28 157 2018-02-27
18 32.94 173 2018-01-29 20.26 189 2018-03-01
19 32.94 173 2018-01-30 22.41 196 2018-03-01
20 32.94 173 2018-01-31 27.32 149 2018-03-01
21 28.09 119 2018-02-01 31.39 192 2018-03-02
22 32.21 199 2018-02-02 28.22 151 2018-03-05
23 21.78 120 2018-02-05 34.82 180 2018-03-06
24 28.25 127 2018-02-06 24.89 186 2018-03-07
25 22.06 189 2018-02-07 32.85 125 2018-03-08
26 33.78 121 2018-02-08 30.12 102 2018-03-09
27 30.79 137 2018-02-09 31.66 137 2018-03-12
28 29.88 131 2018-02-12 25.30 162 2018-03-13
29 20.02 143 2018-02-13 23.44 196 2018-03-14
30 20.28 188 2018-02-14 20.04 102 2018-03-15

Pandas Slicing Between Dates Then Replace Values With Zero

I have the following DataFrame:
Channel Column 1 Column 2 Column 3
Date
12/30/2018 638 4472 487
12/31/2018 868 6985 540
1/1/2019 755 4401 829
1/2/2019 1655 9484 1145
1/3/2019 2002 14212 1158
1/4/2019 1633 9575 1098
1/5/2019 1026 5575 941
1/6/2019 1025 4963 1007
1/7/2019 1944 10685 1246
1/8/2019 2140 9932 1151
1/9/2019 2067 1031 1087
1/10/2019 2168 1005 1074
1/11/2019 2052 9371 909
1/12/2019 1223 5953 895
1/13/2019 1268 4809 827
I would like to return the following result if possible [essentially reduce values between certain dates in a specific column to zero]
Channel Column 1 Column 2 Column 3
Date
12/30/2018 638 4472 487
12/31/2018 868 6985 540
1/1/2019 755 4401 829
1/2/2019 1655 9484 1145
1/3/2019 2002 14212 1158
1/4/2019 1633 9575 1098
1/5/2019 1026 5575 941
1/6/2019 0 4963 1007
1/7/2019 0 10685 1246
1/8/2019 0 9932 1151
1/9/2019 0 1031 1087
1/10/2019 2168 1005 1074
1/11/2019 2052 9371 909
1/12/2019 1223 5953 895
1/13/2019 1268 4809 827
I am trying to filter by a specific column at specific dates, but I can't get it to work properly.
I have tried the following approaches, but I haven't had much luck
df[df['Channel'] == 'Branded Paid Search'].loc['1/6/2019':'1/9/2019']['Sessions'].apply(lambda x: 0 if x < 4000 else 0).to_frame()
This works, but not sure how to get the values back into the original dataframe.
I tried this:
def zero(df):
if df[df['Column 1'] > 0].loc['1/6/2019':'1/9/2019']:
return 0
else:
return 1
df.apply(zero, axis=1)
ValueError: ('The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().')
I tried this:
sessions_df[sessions_df['Column 1'] > 0].loc['1/6/2019':'1/9/2019'] = 0
Nothing changes.
Any help would be appreciated
First create DatetimeIndex by to_datetime and then set values with DataFrame.loc:
df.index = pd.to_datetime(df.index)
df.loc['1/6/2019':'1/9/2019', 'Column 1'] = 0
print (df)
Column 1 Column 2 Column 3
Channel
2018-12-30 638 4472 487
2018-12-31 868 6985 540
2019-01-01 755 4401 829
2019-01-02 1655 9484 1145
2019-01-03 2002 14212 1158
2019-01-04 1633 9575 1098
2019-01-05 1026 5575 941
2019-01-06 0 4963 1007
2019-01-07 0 10685 1246
2019-01-08 0 9932 1151
2019-01-09 0 1031 1087
2019-01-10 2168 1005 1074
2019-01-11 2052 9371 909
2019-01-12 1223 5953 895
2019-01-13 1268 4809 827

Categories

Resources