I have a dataframe with a '%Y/%U' date column:
Value Count YW Date
0 2 2017/19 2017-05-13
1 2 2017/20 2017-05-19
2 24 2017/22 2017-06-03
3 35 2017/23 2017-06-10
4 41 2017/24 2017-06-17
.. ... ... ...
126 51 2020/05 2020-02-06
127 26 2020/06 2020-02-15
128 30 2020/07 2020-02-22
129 26 2020/08 2020-02-29
130 18 2020/09 2020-03-04
I'm trying to add the missing weeks, like 2017/21 with 0 Count values, so I created this index:
idx = pdh.pd.date_range(df['Date'].min(), df['Date'].max(), freq='W').floor('d')
Which yields:
DatetimeIndex(['2017-05-14', '2017-05-21', '2017-05-28', '2017-06-04',
'2017-06-11', '2017-06-18', '2017-06-25', '2017-07-02',
'2017-07-09', '2017-07-16',
...
'2019-12-29', '2020-01-05', '2020-01-12', '2020-01-19',
'2020-01-26', '2020-02-02', '2020-02-09', '2020-02-16',
'2020-02-23', '2020-03-01'],
dtype='datetime64[ns]', length=147, freq=None)
Almost there, converting to '%Y/%U' again:
idx = idx.strftime('%Y/%U')
But this yields:
Index(['2017/20', '2017/21', '2017/22', '2017/23', '2017/24', '2017/25',
'2017/26', '2017/27', '2017/28', '2017/29',
...
'2019/52', '2020/01', '2020/02', '2020/03', '2020/04', '2020/05',
'2020/06', '2020/07', '2020/08', '2020/09'],
dtype='object', length=147)
I'm not sure yet whether it is a problem with reindexing but I've noticed that the firts year/week pair is now 2017/20 instead of 2017/19. This is because the freq='W' offset converts every date to the correspondent week starting day as the default is the same as 'W-SUN' anchored offset. Indeed, 2017-05-14 is a Sunday.
The problem is that the converted date now returns the next week number because of this, 2017-05-13 was converted to 2017-05-14. Using the %U strftime code does start the weeks on Sunday as well, however it is counted from the previous Sunday. Using 'W-SAT' (as 2017-05-13 was a Saturday) solves it at the start but the end will be wrong this case.
Is there any dynamic solution so date_range would start and end with the proper weeks?
Related
I have a dataframe containing hourly data, i want to get the max for each week of the year, so i used resample to group data by week
weeks = data.resample("W").max()
the problem is that week max is calculated starting the first monday of the year, while i want it to be calculated starting the first day of the year.
I obtain the following result, where you can notice that there is 53 weeks, and the last week is calculated on the next year while 2017 doesn't exist in the data
Date dots
2016-01-03 0.647786
2016-01-10 0.917071
2016-01-17 0.667857
2016-01-24 0.669286
2016-01-31 0.645357
Date dots
2016-12-04 0.646786
2016-12-11 0.857714
2016-12-18 0.670000
2016-12-25 0.674571
2017-01-01 0.654571
is there a way to calculate week for pandas dataframe starting first day of the year?
Find the starting day of the year, for example let say it's Friday, and then you can specify an anchoring suffix to resample in order to calculate week starting first day of the year:
weeks = data.resample("W-FRI").max()
One quick remedy is, given you data in one year, you can group it by day first, then take group of 7 days:
new_df = (df.resample("D", on='Date').dots
.max().reset_index()
)
new_df.groupby(new_df.index//7).agg({'Date': 'min', 'dots': 'max'})
new_df.head()
Output:
Date dots
0 2016-01-01 0.996387
1 2016-01-08 0.999775
2 2016-01-15 0.997612
3 2016-01-22 0.979376
4 2016-01-29 0.998240
5 2016-02-05 0.995030
6 2016-02-12 0.987500
and tail:
Date dots
48 2016-12-02 0.999910
49 2016-12-09 0.992910
50 2016-12-16 0.996877
51 2016-12-23 0.992986
52 2016-12-30 0.960348
I have a df with values: eg.
jpy3m jpy1w
timestamp
2019-01-09 00:00:00 -45 -25
2019-01-08 00:00:00 -48 -31
2019-01-07 00:00:00 -51 -27
2019-01-04 00:00:00 -46 -25
...
2016-01-06 00:00:00 -61 -26
2016-01-05 00:00:00 -62 -22
2016-01-04 00:00:00 -57 -21
The index is from today to the start of 2016. Business days only.
What I wish to process, but am unable to, is, for each day where it is possible to do so, take the value in jpy3m and take away the average of (the value of jpy1w on the same day, but also, the value of jpy1w over the next 11 weeks)
e.g. consider 2016-01-04
take value of jpy3m = -57
take average of jpy 1w on (2016-01-04,2016-01-11(1w later),2016-01-19(2w later (18th is not a good business day),2016-01-25(3w later)....etc, until 2016-03-25(11w later))
lets say this averages -25
then take -25 - (-57) = +32, so +32 is the value for 4th jan
This will go in a new column, df['result']
And repeat for the 5th jan 16, and so forth.
I understand the most recent 3 months wont have a result and will be np.nan
is this possible?
thank you
I am assuming that if the day is not a good business day then that record is not in your dataset. If it is in your dataset then you can remove those records.
Sort records in descending order of time.
we are averaging current value with next 7th, 14th, ...., 77th values.
avgs = df['jpy1w'].copy()
for i in range(11):
avgs = df['jpy1w'].shift(7*(i+1)) + avgs
avgs = avgs/12
df['result'] = df['jpy3m'] + avgs
I was answering another question here with something about pandas I thought to know, time series resampling, when I noticed this odd binning.
Let's say I have a dataframe with a daily date range index and a column I want to resample and sum on.
index = pd.date_range(start="1/1/2018", end="31/12/2018")
df = pd.DataFrame(np.random.randint(100, size=len(index)),
columns=["sales"], index=index)
>>> df.head()
sales
2018-01-01 66
2018-01-02 18
2018-01-03 45
2018-01-04 92
2018-01-05 76
Now I resample by one month, everything looks fine:
>>>df.resample("1M").sum()
sales
2018-01-31 1507
2018-02-28 1186
2018-03-31 1382
[...]
2018-11-30 1342
2018-12-31 1337
If I try to resample by more months though binning starts to look off. This is particularly evident with 6M
df.resample("6M").sum()
sales
2018-01-31 1507
2018-07-31 8393
2019-01-31 7283
First bin spans just over one month, last bin goes one month to the future. Maybe I have to set closed="left" to get the proper limits:
df.resample("6M", closed="left").sum()
sales
2018-06-30 8090
2018-12-31 9054
2019-06-30 39
Now I have an extra bin in 2019 with data from 2018-12-31...
Is this working properly? am I missing any option I should set?
EDIT: here's the output I would expect resampling one year in six month intervals, first interval spanning from Jan 1st to Jun 30, second interval spanning from Jul 1st to Dec 31.
df.resample("6M", closed="left").sum()
sales
2018-06-30 8090
2018-12-31 9093 # 9054 + 39
Note that there's also some doubt here about what it's happening with June 30 data, does it go in the first bin like I would expect or the second? I mean with the last bin it's evident but the same is probably happening in all the bins.
The M time offset alias implies month end frequency.
What you need is 6MS which is an alias for month start frequency:
df.resample('6MS').sum()
resulting in
sales
2018-01-01 8130
2018-07-01 9563
2019-01-01 0
Also df.groupby(pd.Grouper(freq='6MS')).sum() can be used interchangeably.
For extra clarity you can compare ranges directly:
>>> pd.date_range('2018-01-01', '2018-12-31', freq='6M')
DatetimeIndex(['2018-01-31', '2018-07-31'], dtype='datetime64[ns]', freq='6M')
>>> pd.date_range('2018-01-01', '2018-12-31', freq='6MS')
DatetimeIndex(['2018-01-01', '2018-07-01'], dtype='datetime64[ns]', freq='6MS')
Adding np.random.seed(365) to check both our outputs.
print(df.resample("6M", kind='period').sum())
sales
2018-01 8794
2018-07 9033
would this work for you?
I am looking to determine the count of string variables in a column across a 3 month data sample. Samples were taken at random times throughout each day. I can group the data by hour, but I require the fidelity of 30 minute intervals (ex. 0500-0600, 0600-0630) on roughly 10k rows of data.
An example of the data:
datetime stringvalues
2018-06-06 17:00 A
2018-06-07 17:30 B
2018-06-07 17:33 A
2018-06-08 19:00 B
2018-06-09 05:27 A
I have tried setting the datetime column as the index, but I cannot figure how to group the data on anything other than 'hour' and I don't have fidelity on the string value count:
df['datetime'] = pd.to_datetime(df['datetime']
df.index = df['datetime']
df.groupby(df.index.hour).count()
Which returns an output similar to:
datetime stringvalues
datetime
5 0 0
6 2 2
7 5 5
8 1 1
...
I researched multi-indexing and resampling to some length the past two days but I have been unable to find a similar question. The desired result would look something like this:
datetime A B
0500 1 2
0530 3 5
0600 4 6
0630 2 0
....
There is no straightforward way to do a TimeGrouper on the time component, so we do this in two steps:
v = (df.groupby([pd.Grouper(key='datetime', freq='30min'), 'stringvalues'])
.size()
.unstack(fill_value=0))
v.groupby(v.index.time).sum()
stringvalues A B
05:00:00 1 0
17:00:00 1 0
17:30:00 1 1
19:00:00 0 1
Is it somehow possible to use resample on irregularly spaced data? (I know that the documentation says it's for "resampling of regular time-series data", but I wanted to try if it works on irregular data, too. Maybe it doesn't, or maybe I am doing something wrong.)
In my real data, I have generally 2 samples per hour, the time difference between them ranging usually from 20 to 40 minutes. So I was hoping to resample them to a regular hourly series.
To test if I am using it right, I used some random list of dates that I already had, so it may not be a best example but at least a solution that works for it will be very robust. here it is:
fraction number time
0 0.729797 0 2014-10-23 15:44:00
1 0.141084 1 2014-10-30 19:10:00
2 0.226900 2 2014-11-05 21:30:00
3 0.960937 3 2014-11-07 05:50:00
4 0.452835 4 2014-11-12 12:20:00
5 0.578495 5 2014-11-13 13:57:00
6 0.352142 6 2014-11-15 05:00:00
7 0.104814 7 2014-11-18 07:50:00
8 0.345633 8 2014-11-19 13:37:00
9 0.498004 9 2014-11-19 22:47:00
10 0.131665 10 2014-11-24 15:28:00
11 0.654018 11 2014-11-26 10:00:00
12 0.886092 12 2014-12-04 06:37:00
13 0.839767 13 2014-12-09 00:50:00
14 0.257997 14 2014-12-09 02:00:00
15 0.526350 15 2014-12-09 02:33:00
Now I want to resample these for example monthly:
df_new = df.set_index(pd.DatetimeIndex(df['time']))
df_new['fraction'] = df.fraction.resample('M',how='mean')
df_new['number'] = df.number.resample('M',how='mean')
But I get TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex' - unless I did something wrong with assigning the datetime index, it must be due to the irregularity?
So my questions are:
Am I using it correctly?
If 1==True, is there no straightforward way to resample the data?
(I only see a solution in first reindexing the data to get finer intervals, interpolate the values in between and then reindexing it to hourly interval. If it is so, then a question regarding the correct implementation of reindex will follow shortly.)
You don't need to explicitly use DatetimeIndex, just set 'time' as the index and pandas will take care of the rest, so long as your 'time' column has been converted to datetime using pd.to_datetime or some other method. Additionally, you don't need to resample each column individually if you're using the same method; just do it on the entire DataFrame.
# Convert to datetime, if necessary.
df['time'] = pd.to_datetime(df['time'])
# Set the index and resample (using month start freq for compact output).
df = df.set_index('time')
df = df.resample('MS').mean()
The resulting output:
fraction number
time
2014-10-01 0.435441 0.5
2014-11-01 0.430544 6.5
2014-12-01 0.627552 13.5