Pandas TimeGrouper: Drop "non full groups"

Pandas TimeGrouper: Drop "non full groups" - python

I'm grouping my data on some frequency, but it appears that TimeGrouper creates a last group on the right for some "left over" data.
df.groupby([pd.TimeGrouper("2AS", label='left')]).sum()['shopping'].plot()
I expect the data to be fairly constant over time, but the last data point at 2013 drops by almost half. I expect this to happen because with biannual grouping, the second half (2014) is missing.
rolling_mean allows center=True, which will will put NaN/drop remainders on the left and right. Is there a similar feature for the Grouper? I couldn't find any on the manual, but perhaps there is a workaround?

I don't think the issue here really concerns options available with TimeGrouper, but rather, how you want to deal with uneven data. You basically have 4 options that I can think of:
1) Drop enough observations (at the start or end) such that you have a multiple of 2 years worth of observations.
2) Extrapolate your starting (or ending) period such that it is comparable to the periods with complete data.
3) Normalize your data to 2 year sums based on underlying time periods of less than 2 years. This approach could be combined with the other two.
4) Instead of a groupby sort of approach, just do a rolling_sum.
Example dataframe:
rng = pd.date_range('1/1/2010', periods=60, freq='1m')
df = pd.DataFrame({ 'shopping' : np.random.choice(12,60) }, index=rng )
I just made the example data set with 5 years of data starting on Jan 1, so if you did this on an annual basis, you'd be done.
df.groupby([pd.TimeGrouper("AS", label='left')]).sum()['shopping']
Out[206]:
2010-01-01 78
2011-01-01 60
2012-01-01 76
2013-01-01 51
2014-01-01 60
Freq: AS-JAN, Name: shopping, dtype: int64
Here's your problem in table form, with the first 2 groups based on 2 years of data but the third group based on only 1 year of data.
df.groupby([pd.TimeGrouper("2AS", label='left')]).sum()['shopping']
Out[205]:
2010-01-01 138
2012-01-01 127
2014-01-01 60
Freq: 2AS-JAN, Name: shopping, dtype: int64
If you take approach (1) above, you just need to drop some observations. It's very easy to drop the later observations and retype the same command. It's a little trickier to drop the earlier observations because then your first observation doesn't begin on Jan 1 of an even year and you lose the automatic labelling and such. Here's an approach that will drop the first year and keep the last 4, but you lose the nice labelling (you can compare with annual data above to verify that this is correct):
In [202]: df2 = df[12:]
In [203]: df2['group24'] = (np.arange( len(df2) ) / 24 ).astype(int)
In [204]: df2.groupby('group24').sum()['shopping']
Out[204]:
group24
0 136
1 111
Alternatively, let's try approach (2), extrapolating. To do this, just replace sum() with mean() and multiply by 24. For the last period, this just means we assume that the 60 in 2014 will be equaled by another 60 in 2015. Whether or not this is reasonable will be a judgement call for you to make and you'd probably want to label with an asterisk and call it an estimate.
df.groupby([pd.TimeGrouper("2AS")]).mean()['shopping']*24
Out[208]:
2010-01-01 138
2012-01-01 127
2014-01-01 120
Freq: 2AS-JAN, Name: shopping, dtype: float64
Also keep in mind this is just one simple (probably simplistic) way you could extrapolate at the end of the period. Whether this is the best way to do it (or whether it makes sense to extrapolate at all) is a judgement call for you to make depending on the situation.
Next, you could take approach (3) and do some sort of normalization. I'm not sure exactly what you want, so I'll just sketch the ideas. If you want to display two year sums, you could just use the earlier example of replacing "2AS" with "AS" and then multiply by 2. This basically makes the table look wrong, but would be a really simple way to make the graph look OK.
Finally, just use rolling sum:
pd.rolling_sum(df.shopping,window=24)
Doesn't table well, but would plot well.

Related

Pandas: combining information from rows with consecutive duplicate values in a column

The title might be a bit confusing here, but in essence:
We have a dataframe, let’s say like this:
SKUCODE
PROCCODE
RUNCODE
RSNCODE
BEGTIME
ENDTIME
RSNNAME
0
218032
A
21183528
1010
2020-11-19 04:00
2020-11-19 04:15
Bank holidays
1
218032
A
21183959
300.318
2021-01-27 10:50
2021-01-27 10:53
Base paper roll change
2
218032
A
21183959
300.302
2021-01-27 11:50
2021-01-27 11:53
Product/grade change
3
218032
A
21183959
300.302
2021-01-27 12:02
2021-01-27 13:12
Product/grade change
4
200402
A
21184021
300.302
2021-01-27 13:16
2021-01-27 13:19
Product/grade change
Where each row is a break event happening on a production line. As can be seen, some singular break events (common RSNNAME) are spread out on multiple consecutive rows (for one data gathering reason or another), and we would like to compress all of these into just one row, for example compressing the rows 2 through 4 to a single row in our example dataframe, resulting in something like this:
SKUCODE
PROCCODE
RUNCODE
RSNCODE
BEGTIME
ENDTIME
RSNNAME
0
218032
A
21183528
1010
2020-11-19 04:00
2020-11-19 04:15
Bank holidays
1
218032
A
21183959
300.318
2021-01-27 10:50
2021-01-27 10:53
Base paper roll change
2
218032
A
21183959
300.302
2021-01-27 11:50
2021-01-27 13:19
Product/grade change
The resulting one row would have the BEGTIME (signifying the start of the break) of the first row that was combined, and the ENDTIME (signifying the end of the break) of the last row that was combined, this way making sure we capture the correct timestamps from the entire break event.
If we want to make the problem harder still, we might want to add a time threshold for row combining. Say, if there is a period of more than 15 minutes between (the ENDTIME of the former and the BEGTIME of the latter) two different rows seemingly of the same row event, we would treat them as separate ones instead.
This is accomplished quite easily through iterrows by comparing one row to the next, checking if they contain a duplicate value in the RSNNAME column, and grabbing the ENDTIME of the latter one onto the former one if that is the case. The latter row can then be dropped as useless. Here we might also introduce logic to see if the seemingly singular break events might actually be telling of two different ones of the same nature merely happening some time apart of each other.
However, using iterrows for this purpose gets quite slow. Is there a way to figure out this problem through vectorized functions or other more efficient means? I've played around with shifting the rows around and comparing to each other - shifting and comparing two adjacent rows is quite simple and allows us to easily grab the ENDTIME of the latter row if a duplicate is detected, but we run into issues in the case of n consecutive duplicate causes.
Another idea would be to create a boolean mask to check if the row below the current one is a duplicate, resulting in a scenario where, in the case of multiple consecutive duplicate rows, we have multiple corresponding consecutive "True" labels, the last of which before a "False" label signifies the last row where we would want to grab the ENDTIME from for the first consecutive "True" label of that particular series of consecutive "Trues". I'm yet to find a way to implement this in practice, however, using vectorization.

For the basic problem, drop_duplicates can be used. Essentially, you
drop the duplicates on the RSNNAME column, keeping the first occurrence
replace the ENDTIME column with the end times by dropping duplicates again, this time keeping the last occurrence.
(
df.drop_duplicates("RSNNAME", keep="first").assign(
ENDTIME=df.drop_duplicates("RSNNAME", keep="last").ENDTIME.values
)
)
(By using the .values we ignore the index in the assignment.)
To give you an idea for the more complex scenario: You are on the right track with your last idea. You want to .shift the column in question by one row and compare that to the original column. That gives you flags where new consecutive events start:
>>> df.RSNNAME != df.shift().RSNNAME
0 True
1 True
2 True
3 False
4 False
Name: RSNNAME, dtype: bool
To turn that into something .groupby-able, you compute the cumulative sum:
>>> (df.RSNNAME != df.shift().RSNNAME).cumsum()
0 1
1 2
2 3
3 3
4 3
Name: RSNNAME, dtype: int64
For your case, one option could be to extend the df.RSNNAME != df.shift().RSNNAME with some time difference to get the proper flags but I suggest you play a bit with this shift/cumsum approach.

df1 looks like :-
new_df = pd.DataFrame(columns=df1.columns)
for name,df in df1.groupby([(df1.RSNNAME != df1.RSNNAME.shift()).cumsum()] ) :
if df.shape[0] == 1 :
new_df = pd.concat([new_df,df])
else :
df.iloc[0, df.columns.get_loc('ENDTIME')] = df.iloc[-1]["ENDTIME"]
new_df = pd.concat([new_df,df.head(1)])
new_df looks like :-
If you really think that I was using chatgpt then how do you explain all this?
Be greatful that someone is trying to help you.

How to optimize this specific python code for a very large DataFrame?

I start with a large list of all Bitcoin prices. I import it into a Dataframe.
df.head()
BTC-USDT_close
open_time
2021-11-05 22:28:00 61151.781250
2021-11-05 22:27:00 61199.011719
2021-11-05 22:26:00 61201.398438
2021-11-05 22:25:00 61237.828125
2021-11-05 22:24:00 61195.578125
...
221651 rows total.
What I need is the following:
For each row in this dataframe
take next 60 values
take next 60 in every 5 values
take next 60 in every 15 values
take next 60 in every 60 values
take next 60 in every 360 values
take next 60 in every 5760 values
add this new table of 60 rows as an array to a list
So in the end I want to have a lot of these:
small_df.head(6)
BTC-USDT_1m BTC-USDT_5m BTC-USDT_15m BTC-USDT_1h BTC-USDT_6h BTC-USDT_4d
0 61199.011719 61199.011719 61199.011719 61199.011719 61199.011719 61199.011719
1 61201.398438 61241.390625 61159.578125 61079.800781 60922.968750 60968.320312
2 61237.828125 61309.000000 61063.628906 60845.710938 61682.960938 60717.500000
3 61195.578125 61159.578125 61100.000000 61060.000000 62191.000000 60939.210938
4 61221.179688 61165.371094 61079.800781 61220.011719 61282.000000 65934.328125
5 61241.390625 61047.488281 61175.238281 60812.210938 61190.300781 60599.000000
...
60 rows total
(Basically these are the sequences of 60 previous values in different time frames)
So the code is as follows:
seq_list = []
for i in range(len(df) // 2):
r = i+1
small_df = pd.DataFrame()
small_df['BTC-USDT_1m'] = df['BTC-USDT_close'][r:r+seq_len:1].reset_index(drop=True)
small_df['BTC-USDT_5m'] = df['BTC-USDT_close'][r:(r+seq_len)*5:5].dropna().reset_index(drop=True)
small_df['BTC-USDT_15m'] = df['BTC-USDT_close'][r:(r+seq_len)*15:15].dropna().reset_index(drop=True)
small_df['BTC-USDT_1h'] = df['BTC-USDT_close'][r:(r+seq_len)*60:60].dropna().reset_index(drop=True)
small_df['BTC-USDT_6h'] = df['BTC-USDT_close'][r:(r+seq_len)*360:360].dropna().reset_index(drop=True)
small_df['BTC-USDT_4d'] = df['BTC-USDT_close'][r:(r+seq_len)*5760:5760].dropna().reset_index(drop=True)
seq_list.append([small_df, df['target'][r]])
As you can imagine, it's very slow, it can do about 1500 sequences per minute, so the whole process is going to take 12 hours.
Could you please show me a way to speed things up?
Thanks in advance!

You wouldn't do this by indexing as this creates larges indexes and is inefficient. Instead, you would use .rolling() to create rolling windows.
You can see in the documentation, that rolling also supports rolling windows over timestamps. See this copy for the result:
>>> df_time.rolling('2s').sum()
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 3.0
2013-01-01 09:00:05 NaN
2013-01-01 09:00:06 4.0
In your case, you could do the following
small_df['BTC-USDT_1m'] = df['BTC-USDT_close'].rolling("1m").mean().reset_index(drop=True)
The first argument is always the size of the window, i.e. the number of samples to take from df. This can be an integer for exact number of smaples or a timestamp, to march through the table based on a fixed timeframe.
In this case it would compute the mean price based on a moving window of 1 minute.
This would be way more accurate then your index-based solution, since there you do not take into account the distance between the timestamps and also you are actually just taking single values, which means you are highly dependent on local fluctuations. A mean over a given window size supplies you with the average change in price over that timespan.
However, if you want just the single price at a given size, then you simply use a small window, like
small_df['BTC-USDT_1m'] = df['BTC-USDT_close'].rolling(1, step=60).reset_index(drop=True)
The step argument here makes the moving window not consider every single element but rather move 60 samples each time a value is taken.
Any solution like yours or the latter with step, of course, produces a number of new samples different from the original one, thus you would have to think if you want to drop NaN values, fill in gaps, use expand, ...

Averaging the last 7 days in a dataframe based on a specific label against the adjoining column

I'm working with a time series dataframe which shows cumulative positions of a given entity for each hour of the day from 01/06/2022 - 22/08/2022. I'm looking to take the average of the last 7 days which have a specific label against it, which may not necessarily fall on the previous 7 days leading up to the 22/08/2022, see example below:
The labels against the entries can include:
5f
5i
5j
5x
5h
Each day will have one label against it, with it repeating for each hour in the day. To put it simply, I want to average the most recent 7 days in the 2 month dataframe which has one of these specific labels against it, i.e The average cumulativeVol for each hour for the past 7 days where we have had a 5f strategy, or 5i etc.
Expected output of the script should have a DF dimension of [24x1].
I'm wondering can this be achieved solely through Pandas? Or would a tailored method need developed?
Any help greatly appreciated.

IIUC this will give the average cumulativeVols per labels:
last_week = df[df.StartDateTime >= (df.StartDateTime.iloc[-1] - pd.Timedelta(weeks=1))]
last_week.set_index('StartDateTime', inplace=True)
last_week.groupby([pd.Grouper(freq='H'), 'Strategy']).cumulativeVols.mean()

Remove zeros in pandas dataframe without effecting the imputation result

I have a timeseries dataset with 5M rows.
The column has 19.5% missing values, 80% zeroes (don't go by the percentage values - although it means only 0.5% of data is useful but then 0.5% of 5M is enough). Now, I need to impute this column.
Given the number of rows, it's taking around 2.5 hours for KNN to impute the whole thing.
To make it faster, I thought of deleting all the zero values rows and then carry out the imputation process. But I feel that using KNN naively after this would lead to overestimation (since all the zero values are gone and keeping the number of neighbours fixed, the mean is expected to increase).
So, is there a way:
To modify the data input to the KNN model
Carry out the imputation process after removing the rows with zeros so that the values obtained after imputation are the same or at least near
To understand the problem more clearly, consider the following dummy dataframe:
DATE VALUE
0 2018-01-01 0.0
1 2018-01-02 8.0
2 2018-01-03 0.0
3 2018-01-04 0.0
4 2018-01-05 0.0
5 2018-01-06 10.0
6 2018-01-07 NaN
7 2018-01-08 9.0
8 2018-01-09 0.0
9 2018-01-10 0.0
Now, if I use KNN (k=3), then with zeros, the value would be the weighted mean of 0, 10 and 9. But if I remove the zeros naively, the value will be imputed with the weighted mean of 8, 10 and 9.
A few rough ideas which I thought of but could not proceed through were as follows:
Modifying the weights (used in the weighted mean computation) of the KNN imputation process so that the removed 0s are taken into account during the imputation.
Adding a column which says how many neighbouring zeros a particular column has and then somehow use it to modify the imputation process.
Points 1. and 2. are just rough ideas which came across my mind while thinking about how to solve the problem and might help one while answering the answer.
PS -
Obviously, I am not feeding the time series data directly into KNN. What I am doing is extracting month, day, etc. from the date column, and then using this for imputation.
I do not need parallel processing as an answer to make the code run faster. The data is so large that high RAM usage hangs my laptop.

Let's think logically, leave the machine learning part aside for the moment.
Since we are dealing with time series, it would be good if you impute the data with the average of values for the same date in different years, say 2-3 years ( if we consider 2 years, then 1 year before and 1 year after the missing value year), would recommend not to go beyond 3 years. We have computed x now.
Further to make this computed value x close to the current data, use an average of x and y, y is linear interpolation value.
In the above example, y = (10 + 9)/2, i.e. average of one value before and one value after the data to be imputed.

Date ranges in Pandas

After fighting with NumPy and dateutil for days, I recently discovered the amazing Pandas library. I've been poring through the documentation and source code, but I can't figure out how to get date_range() to generate indices at the right breakpoints.
from datetime import date
import pandas as pd
start = date('2012-01-15')
end = date('2012-09-20')
# 'M' is month-end, instead I need same-day-of-month
date_range(start, end, freq='M')
What I want:
2012-01-15
2012-02-15
2012-03-15
...
2012-09-15
What I get:
2012-01-31
2012-02-29
2012-03-31
...
2012-08-31
I need month-sized chunks that account for the variable number of days in a month. This is possible with dateutil.rrule:
rrule(freq=MONTHLY, dtstart=start, bymonthday=(start.day, -1), bysetpos=1)
Ugly and illegible, but it works. How can do I this with pandas? I've played with both date_range() and period_range(), so far with no luck.
My actual goal is to use groupby, crosstab and/or resample to calculate values for each period based on sums/means/etc of individual entries within the period. In other words, I want to transform data from:
total
2012-01-10 00:01 50
2012-01-15 01:01 55
2012-03-11 00:01 60
2012-04-28 00:01 80
#Hypothetical usage
dataframe.resample('total', how='sum', freq='M', start='2012-01-09', end='2012-04-15')
to
total
2012-01-09 105 # Values summed
2012-02-09 0 # Missing from dataframe
2012-03-09 60
2012-04-09 0 # Data past end date, not counted
Given that Pandas originated as a financial analysis tool, I'm virtually certain that there's a simple and fast way to do this. Help appreciated!

freq='M' is for month-end frequencies (see here). But you can use .shift to shift it by any number of days (or any frequency for that matter):
pd.date_range(start, end, freq='M').shift(15, freq=pd.datetools.day)

There actually is no "day of month" frequency (e.g. "DOMXX" like "DOM09"), but I don't see any reason not to add one.
http://github.com/pydata/pandas/issues/2289
I don't have a simple workaround for you at the moment because resample requires passing a known frequency rule. I think it should be augmented to be able to take any date range to be used as arbitrary bin edges, also. Just a matter of time and hacking...

try
date_range(start, end, freq=pd.DateOffset(months=1))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.