Where am I going wrong with downsampling this data frame?

Where am I going wrong with downsampling this data frame? - python

I have two dataframes.
One lists 30-minute observations of particular values (actual and forecast) over a 24-hour period (48 observations):
> api_df.head()
from to actual forecast index
0 2019-11-24T23:30Z 2019-11-25T00:00Z 200 210 moderate
1 2019-11-25T00:00Z 2019-11-25T00:30Z 200 199 moderate
2 2019-11-25T00:30Z 2019-11-25T01:00Z 198 200 moderate
3 2019-11-25T01:00Z 2019-11-25T01:30Z 189 204 moderate
4 2019-11-25T01:30Z 2019-11-25T02:00Z 191 199 moderate
The other is observations of another value (KW) over an entire day:
> csv_extr.head()
Date Time KW
3764 2019-11-25 13:45:26.1050000 -424.437988
2911 2019-11-25 16:41:12.4040000 -465.325989
1786 2019-11-25 13:06:54.0290000 -431.795013
4352 2019-11-25 18:42:19.9360000 -452.528992
4634 2019-11-25 19:35:19.9230000 -457.210999
I want to get csv_extr to resemble api_df as closely as possbile, so I decided to downsample it and I ended up with something that is almost what (I think) I'm looking for, but there are some clear issues, namely the Time value and NaN observations:
> x.head()
Date Time KW time
0 2019-11-25 00:00:01.6470000 -100.0 0
1 NaN NaN NaN 0
2 NaN NaN NaN 1
3 2019-11-25 01:57:04.7700000 0.0 0
4 NaN NaN NaN 2
I have looked for possible reasons why, and I can only assume that these results are occurring because Time is set to seven decimal places and thus causes the output to not be neatly aligned to 30-minute blocks for some reason.
I achieve this final output (x) using some code that I found online, but I cannot find an explanation as to what precisely the code is doing and would like some guidance here also:
t = pd.to_timedelta(csv_extr.Time, unit = "min")
s = csv_extr.set_index(t).resample('30min').last().reset_index(drop = True)
x = s.assign(time = s.groupby("Time").cumcount())
There is a single error notification that appears when this code segment runs which is:
ValueError: only leading negative signs are allowed
As before, I have looked for what this might mean but haven't yet found anything that clearly explains it.
I am happy to provide data for reprex purposes; the reason I haven't provided anything (yet) is because I am unclear of the best way to do this in Python/pandas (some guidance here would be good). Also, I am hoping that the solution is a case of a more experienced Python user than myself looking at the code and spotting something obvious. Otherwise I am happy to provide the data required for a reprex.

You converted "Time" to a timedelta, but it looks like a timestamp to me, so I think you want pd.to_datetime, which is what I did to get the following approximation to your data. And I also set the index to the new "DateTime" column and drop the old columns:
csv_extr['DateTime'] = pd.to_datetime(csv_extr.Date + ' ' + csv_extr.Time)
csv_extr = csv_extr[['KW','DateTime']].set_index('DateTime')
KW
DateTime
2019-11-25 13:45:26.105 -424.437988
2019-11-25 16:41:12.404 -465.325989
2019-11-25 13:06:54.029 -431.795013
2019-11-25 18:42:19.936 -452.528992
2019-11-25 19:35:19.923 -457.210999
It's pretty straightforward after that. I'll show 60 minute resampling here to keep the output more compact, but it works just the same for 30 minute sampling, of course:
csv_extr.resample('60 min').last()
KW
DateTime
2019-11-25 13:00:00 -424.437988
2019-11-25 14:00:00 NaN
2019-11-25 15:00:00 NaN
2019-11-25 16:00:00 -465.325989
2019-11-25 17:00:00 NaN
2019-11-25 18:00:00 -452.528992
2019-11-25 19:00:00 -457.210999
I assume you want to fill in the missing values there. Without knowing more about your data, I'd suggest a simple linear interpolation like the following (but pandas and numpy have plenty of other options if you want something more complicated):
csv_extr.resample('60 min').last().interpolate()
KW
DateTime
2019-11-25 13:00:00 -424.437988
2019-11-25 14:00:00 -438.067322
2019-11-25 15:00:00 -451.696655
2019-11-25 16:00:00 -465.325989
2019-11-25 17:00:00 -458.927490
2019-11-25 18:00:00 -452.528992
2019-11-25 19:00:00 -457.210999

Related

selecting rows in dataframe using datetime.datetime

Python is new for me.
I want to select a range of rows by using the datetime which is also the index.
I am not sure if having the datetime as the index is a problem or not.
my dataframe looks like this:
gradient
date
2022-04-15 10:00:00 0.013714
2022-04-15 10:20:00 0.140792
2022-04-15 10:40:00 0.148240
2022-04-15 11:00:00 0.016510
2022-04-15 11:20:00 0.018219
...
2022-05-02 15:40:00 0.191208
2022-05-02 16:00:00 0.016198
2022-05-02 16:20:00 0.043312
2022-05-02 16:40:00 0.500573
2022-05-02 17:00:00 0.955833
And I have made variables which contain the start and end date of the rows I want to select. This looks like this:
A_start_646 = datetime.datetime(2022,4,27, 11,0,0)
S_start_646 = datetime.datetime(2022,4,28, 3,0,0)
D_start_646 = datetime.datetime(2022,5,2, 15,25,0)
D_end_646 = datetime.datetime(2022,5, 2, 15,50,0)
So I would like to make a new dataframe. I saw some examples on the internet, but they use another way of expressing the date.
Does somewan know a solution?

I feel kind of stupid and smart at the same time now. This because I have already answered my own question, my apologies
So this is the answer:
new_df = data_646_mean[A_start_646 : S_start_646]

How do I take the mean on either side of a value in a pandas DataFrame?

I have a Pandas DataFrame where the index is datetimes for every 12 minutes in a day (120 rows total). I went ahead and resampled the data to every 30 minutes.
Time Rain_Rate
1 2014-04-02 00:00:00 0.50
2 2014-04-02 00:30:00 1.10
3 2014-04-02 01:00:00 0.48
4 2014-04-02 01:30:00 2.30
5 2014-04-02 02:00:00 4.10
6 2014-04-02 02:30:00 5.00
7 2014-04-02 03:00:00 3.20
I want to take 3 hour means centered on hours 00, 03, 06, 09, 12, 15 ,18, and 21. I want the mean to consist of 1.5 hours before 03:00:00 (so 01:30:00) and 1.5 hours after 03:00:00 (04:30:00). The 06:00:00 time would overlap with the 03:00:00 average (they would both use 04:30:00).
Is there a way to do this using pandas? I've tried a few things but they haven't worked.

Method 1
I'm going to suggest just change your resample from the get-go to get the chunks you want. Here's some fake data resembling yours, before resampling at all:
dr = pd.date_range('04-02-2014 00:00:00', '04-03-2014 00:00:00', freq='12T', closed='left')
data = np.random.rand(120)
df = pd.DataFrame(data, index=dr, columns=['Rain_Rate'])
df.index.name = 'Time'
#df.head()
Rain_Rate
Time
2014-04-02 00:00:00 0.616588
2014-04-02 00:12:00 0.201390
2014-04-02 00:24:00 0.802754
2014-04-02 00:36:00 0.712743
2014-04-02 00:48:00 0.711766
Averaging by 3 hour chunks initially will be the same as doing 30 minute chunks then doing 3 hour chunks. You just have to tweak a couple things to get the right bins you want. First you can add the bin you will start from (i.e. 10:30 pm on the previous day, even if there's no data there; the first bin is from 10:30pm - 1:30am), then resample starting from this point
before = df.index[0] - pd.Timedelta(minutes=90) #only if the first index is at midnight!!!
df.loc[before] = np.nan
df = df.sort_index()
output = df.resample('3H', base=22.5, loffset='90min').mean()
The base parameter here means start at the 22.5th hour (10:30), and loffset means push the bin names back by 90 minutes. You get the following output:
Rain_Rate
Time
2014-04-02 00:00:00 0.555515
2014-04-02 03:00:00 0.546571
2014-04-02 06:00:00 0.439953
2014-04-02 09:00:00 0.460898
2014-04-02 12:00:00 0.506690
2014-04-02 15:00:00 0.605775
2014-04-02 18:00:00 0.448838
2014-04-02 21:00:00 0.387380
2014-04-03 00:00:00 0.604204 #this is the bin at midnight on the following day
You could also start with the data binned at 30 minutes and use this method, and should get the same answer.*
Method 2
Another approach would be to find the locations of the indexes you want to create averages for, and then calculate the averages for entries in the 3 hours surrounding:
resampled = df.resample('30T',).mean() #like your data in the post
centers = [0,3,6,9,12,15,18,21]
mask = np.where(df.index.hour.isin(centers) & (df.index.minute==0), True, False)
df_centers = df.index[mask]
output = []
for center in df_centers:
cond1 = (df.index >= (center - pd.Timedelta(hours=1.5)))
cond2 = (df.index <= (center + pd.Timedelta(hours=1.5)))
output.append(df[cond1 & cond2].values.mean())
Output here is the same, but the answers are in a list (and the last point of "24 hours" is not included):
[0.5555146139562004,
0.5465709237162698,
0.43995277270996735,
0.46089800625663596,
0.5066902552121085,
0.6057747262752732,
0.44883794039466535,
0.3873795731806939]
*You mentioned you wanted some points on the edge of bins to be included in both bins. resample doesn't do this (and generally I don't think most people want to do so), but the second method I used is explicit about doing so (by using >= and <= in cond1 and cond2). However, these two methods achieve the same result here, presumably b/c of the use of resample at different stages causing data points to be included in different bins. It's hard for me to wrap my around that, but one could do a little manual binning to verify what is going on. The point is, I would recommend spot-checking the output of these methods (or any resample-based method) against your raw data to make sure things look correct. For these examples, I did so using Excel.

pandas resample starting before the dataset first entry

Dear experienced community,
I can't find an elegant solution to my problem.
I have a subsample of my dataset which I want to resample weekly, but starting some weeks before the first entry in my data frame (so a few weeks with 0 counts)
A sample of the data:
In:
print(df_pec.head())
Out:
Count Image_Sequence_DateTime
18 1 2015-11-06 03:22:19
21 1 2015-11-11 01:48:51
22 1 2015-11-11 07:30:47
37 1 2015-11-25 09:42:23
48 1 2015-12-05 12:12:34
With the earliest image sequence at:
In:
df_pec.Image_Sequence_DateTime.min()
Out:
2015-09-30 15:16:38
I have another function that gives me the starting point of the first week and the last point of the last week ever measured in that experiment, which are:
In:
print(s_startend)
Out:
Start 2015-09-28
End 2017-12-25
dtype: datetime64[ns]
My problem is that I want to resample df_pec weekly, but starting on the very first second of the very first day of the very first week of the experimental deployment.(using s_startend as reference)
I try:
df_pec=df_pec.resample('1W', on='Image_Sequence_DateTime').sum()
print(df_pec.head(),'\n',df_pec.tail())
Out:
Count
Image_Sequence_DateTime
2015-10-04 26.0
2015-10-11 92.0
2015-10-18 204.0
2015-10-25 193.0
2015-11-01 187.0
Count
Image_Sequence_DateTime
2017-11-19 20.0
2017-11-26 34.0
2017-12-03 16.0
2017-12-10 11.0
2017-12-17 3.0
This is pretty weird because it is even skipping the first days of data in df_pec.(starting 2015-09-30 15:16:38)
And even if it worked, I have no way of indicating the resampling to start and end in specified values (s_startend from my example), even if there are no records in the earliest and latest weeks in my subsample df_pec.
I thought about artificially adding two entries to df_pec with the real start and real end, but I think it is not so elegant and I don't want to be adding meaningless keys to my df.
Thank you very much for your wisdom!

resample irregularly spaced data in pandas

Is it somehow possible to use resample on irregularly spaced data? (I know that the documentation says it's for "resampling of regular time-series data", but I wanted to try if it works on irregular data, too. Maybe it doesn't, or maybe I am doing something wrong.)
In my real data, I have generally 2 samples per hour, the time difference between them ranging usually from 20 to 40 minutes. So I was hoping to resample them to a regular hourly series.
To test if I am using it right, I used some random list of dates that I already had, so it may not be a best example but at least a solution that works for it will be very robust. here it is:
fraction number time
0 0.729797 0 2014-10-23 15:44:00
1 0.141084 1 2014-10-30 19:10:00
2 0.226900 2 2014-11-05 21:30:00
3 0.960937 3 2014-11-07 05:50:00
4 0.452835 4 2014-11-12 12:20:00
5 0.578495 5 2014-11-13 13:57:00
6 0.352142 6 2014-11-15 05:00:00
7 0.104814 7 2014-11-18 07:50:00
8 0.345633 8 2014-11-19 13:37:00
9 0.498004 9 2014-11-19 22:47:00
10 0.131665 10 2014-11-24 15:28:00
11 0.654018 11 2014-11-26 10:00:00
12 0.886092 12 2014-12-04 06:37:00
13 0.839767 13 2014-12-09 00:50:00
14 0.257997 14 2014-12-09 02:00:00
15 0.526350 15 2014-12-09 02:33:00
Now I want to resample these for example monthly:
df_new = df.set_index(pd.DatetimeIndex(df['time']))
df_new['fraction'] = df.fraction.resample('M',how='mean')
df_new['number'] = df.number.resample('M',how='mean')
But I get TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex' - unless I did something wrong with assigning the datetime index, it must be due to the irregularity?
So my questions are:
Am I using it correctly?
If 1==True, is there no straightforward way to resample the data?
(I only see a solution in first reindexing the data to get finer intervals, interpolate the values in between and then reindexing it to hourly interval. If it is so, then a question regarding the correct implementation of reindex will follow shortly.)

You don't need to explicitly use DatetimeIndex, just set 'time' as the index and pandas will take care of the rest, so long as your 'time' column has been converted to datetime using pd.to_datetime or some other method. Additionally, you don't need to resample each column individually if you're using the same method; just do it on the entire DataFrame.
# Convert to datetime, if necessary.
df['time'] = pd.to_datetime(df['time'])
# Set the index and resample (using month start freq for compact output).
df = df.set_index('time')
df = df.resample('MS').mean()
The resulting output:
fraction number
time
2014-10-01 0.435441 0.5
2014-11-01 0.430544 6.5
2014-12-01 0.627552 13.5

Python -Pandas Downsampling with first returns NaN

I am trying use pandas to resample vessel tracking data from seconds to minutes using how='first'. The dataframe is called hg1s. The unique ID is called MMSI. The datetime index is TX_DTTM. Here is a data sample:
TX_DTTM MMSI LAT LON NS
2013-10-01 00:00:02 367542760 29.660550 -94.974195 15
2013-10-01 00:00:04 367542760 29.660550 -94.974195 15
2013-10-01 00:00:07 367451120 29.614161 -94.954459 0
2013-10-01 00:00:15 367542760 29.660210 -94.974069 15
2013-10-01 00:00:13 367542760 29.660210 -94.974069 15
The code to resample:
hg1s1min = hg1s.groupby('MMSI').resample('1Min', how='first')
And a data sample of the output:
hg1s1min[20000:20004]
MMSI TX_DTTM NS LAT LON
367448060 2013-10-21 00:42:00 NaN NaN NaN
2013-10-21 00:43:00 NaN NaN NaN
2013-10-21 00:44:00 NaN NaN NaN
2013-10-21 00:45:00 NaN NaN NaN
It's safe to assume that there are several data points within each minute, so I don't understand why this isn't picking up the first record for that method. I looked at this link: Pandas Downsampling Issue because it seemed similar to my problem. I tried passing label='left' and label='right', neither worked.
How do I return the first record in every minute for each MMSI?

As it turns out, the problem isn't with the method, but with my assumption about the data. The large data set is a month, or 44640 minutes. While every record in my dataset has the relevant values, there isn't 100% overlap in time. In this case MMSI = 367448060 is present at 2013-10-17 23:24:31 and again at 2013-10-29 20:57:32. between those two data points, there isn't data to sample, resulting in a NaN, which is correct.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Where am I going wrong with downsampling this data frame? - python

Related

selecting rows in dataframe using datetime.datetime

How do I take the mean on either side of a value in a pandas DataFrame?

pandas resample starting before the dataset first entry

resample irregularly spaced data in pandas

Python -Pandas Downsampling with first returns NaN

Categories

Resources