Pandas Dataframe.resample() does not overwrite DataFrame - python

I don't know why the resample function doesn't work properly.
I'm trying to resample with the simplest example but the function does not overwrite the original dataframe.
import pandas as pd
index = pd.date_range('1/1/2019', periods=8, freq='T')
series = pd.Series(range(8), index=index)
series.resample('3T').sum()
printing the resulting series gives me:
print(series)
>2019-01-01 00:00:00 0
>2019-01-01 00:01:00 1
>2019-01-01 00:02:00 2
>2019-01-01 00:03:00 3
>2019-01-01 00:04:00 4
>2019-01-01 00:05:00 5
>2019-01-01 00:06:00 6
>2019-01-01 00:07:00 7
>Freq: T, dtype: int64
In my use case I wanted to upscale a dataframe but that also didn't work.

Related

How to append hour:min:sec to the DateTime in pandas Dataframe

I have following dataframe, where date was set as the index col,
date
renormalized
2017-01-01
6
2017-01-08
5
2017-01-15
3
2017-01-22
3
2017-01-29
3
I want to append 00:00:00 to each of the datetime in the index column, make it like
date
renormalized
2017-01-01 00:00:00
6
2017-01-08 00:00:00
5
2017-01-15 00:00:00
3
2017-01-22 00:00:00
3
2017-01-29 00:00:00
3
It seems I got stuck for no solution to make it happen.... It will be great if anyone can help...
Thanks
AL
When your time is 0 for all instances, pandas doesn't show the time by default (although it's a Timestamp class, so it has the time!). Probably your data is already normalized, and you can perform delta time operations as usual.
You can see a target observation with df.index[0] for instance, or take a look at all the times with df.index.time.
You can use DatetimeIndex.strftime
df.index = pd.to_datetime(df.index).strftime('%Y-%m-%d %H:%M:%S')
print(df)
renormalized
date
2017-01-01 00:00:00 6
2017-01-08 00:00:00 5
2017-01-15 00:00:00 3
2017-01-22 00:00:00 3
2017-01-29 00:00:00 3
Or you can choose
df.index = df.index + ' 00:00:00'

Remove data timestamp and get data only every hours python

I have a bunch of timestamp data in a csv file like this:
2012-01-01 00:00:00, data
2012-01-01 00:01:00, data
2012-01-01 00:02:00, data
...
2012-01-01 00:59:00, data
2012-01-01 01:00:00, data
2012-01-01 01:01:00, data
I want to delete data every minute and only display every hour in python like the following:
2012-01-01 00:00:00, data
2012-01-01 01:00:00, data
2012-01-01 02:00:00, data
Could any one help me? Thank you.
I believe you need to use pandas resample, here's is an example of how it is used to achieve the output you desire. However, keep in mind that since this is a resampling operation during frequency conversion, you must pass a function on how the other columns will beahve (summing all values corresponding to the new timeframe, calculating an average, calculating the difference, etc...) otherwise you will get returned a DatetimeIndexResample. Here is an example:
import pandas as pd
index = pd.date_range('1/1/2000', periods=9, freq='40T')
series = pd.Series(range(9),index=index)
print(series)
Output:
2000-01-01 00:00:00 0
2000-01-01 00:40:00 1
2000-01-01 01:20:00 2
2000-01-01 02:00:00 3
2000-01-01 02:40:00 4
2000-01-01 03:20:00 5
2000-01-01 04:00:00 6
2000-01-01 04:40:00 7
2000-01-01 05:20:00 8
Applying resample hourly without passing the aggregation function:
print(series.resample('H'))
Output:
DatetimeIndexResampler [freq=<Hour>, axis=0, closed=left, label=left, convention=start, base=0]
After passing .sum():
print(series.resample('H').sum())
Output:
2000-01-01 00:00:00 1
2000-01-01 01:00:00 2
2000-01-01 02:00:00 7
2000-01-01 03:00:00 5
2000-01-01 04:00:00 13
2000-01-01 05:00:00 8
Freq: H, dtype: int64

Problems resampling pandas time series with time gap between indices

I want to resample the pandas series
import pandas as pd
index_1 = pd.date_range('1/1/2000', periods=4, freq='T')
index_2 = pd.date_range('1/2/2000', periods=3, freq='T')
series = pd.Series(range(4), index=index_1)
series=series.append(pd.Series(range(3), index=index_2))
print series
>>>2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-02 00:00:00 0
2000-01-02 00:01:00 1
2000-01-02 00:02:00 2
such that the resulting DataSeries only contains every second entry, i.e
>>>2000-01-01 00:00:00 0
2000-01-01 00:02:00 2
2000-01-02 00:00:00 0
2000-01-02 00:02:00 2
using the (poorly documented) resample method of pandas in the following way:
resampled_series = series.resample('2T', closed='right')
print resampled_series
I get
>>>1999-12-31 23:58:00 0.0
2000-01-01 00:00:00 1.5
2000-01-01 00:02:00 3.0
2000-01-01 00:04:00 NaN
2000-01-01 00:56:00 NaN
...
2000-01-01 23:54:00 NaN
2000-01-01 23:56:00 NaN
2000-01-01 23:58:00 0.0
2000-01-02 00:00:00 1.5
2000-01-02 00:02:00 3.0
Why does it start 2 minutes earlier than the original series? why does it contain all the time steps inbetween, which are not contained in the original series? How can I get my desired result?
resample() is not the right function for your purpose.
try this:
series[series.index.minute % 2 == 0]

Split dataframe into n equal time intervals, to groupby, where time interval is (time.max() - time.min())/ n

I have a dataframe which I want to split into 5 chunks (more generally n chunks), so that I can apply a groupby on the chunks.
I want the chunks to have equal time intervals but in general each group may contain different numbers of records.
Let's call the data
s = pd.Series(pd.date_range('2012-1-1', periods=100, freq='D'))
and the timeinterval ti = (s.max() - s.min())/n
So the first chunk should include all rows with dates between s.min() and s.min() + ti, the second, all rows with dates between s.min() + ti and s.min() + 2*ti, etc.
Can anyone suggest an easy way to achieve this? If somehow I could convert all my dates into seconds since the epoch, then I could do something like thisgroup = floor(thisdate/ti).
Is there an easy 'pythonic' or 'panda-ista' way to do this?
Thanks very much (and Merry Christmas!),
Robin
You can use numpy.array_split:
>>> import pandas as pd
>>> import numpy as np
>>> s = pd.Series(pd.date_range('2012-1-1', periods=10, freq='D'))
>>> np.array_split(s, 5)
[0 2012-01-01 00:00:00
1 2012-01-02 00:00:00
dtype: datetime64[ns], 2 2012-01-03 00:00:00
3 2012-01-04 00:00:00
dtype: datetime64[ns], 4 2012-01-05 00:00:00
5 2012-01-06 00:00:00
dtype: datetime64[ns], 6 2012-01-07 00:00:00
7 2012-01-08 00:00:00
dtype: datetime64[ns], 8 2012-01-09 00:00:00
9 2012-01-10 00:00:00
dtype: datetime64[ns]]
>>> np.array_split(s, 2)
[0 2012-01-01 00:00:00
1 2012-01-02 00:00:00
2 2012-01-03 00:00:00
3 2012-01-04 00:00:00
4 2012-01-05 00:00:00
dtype: datetime64[ns], 5 2012-01-06 00:00:00
6 2012-01-07 00:00:00
7 2012-01-08 00:00:00
8 2012-01-09 00:00:00
9 2012-01-10 00:00:00
dtype: datetime64[ns]]
The answer is as follows:
s = pd.DataFrame(pd.date_range('2012-1-1', periods=20, freq='D'), columns=["date"])
n = 5
s["date"] = np.int64(s) #This step may not be needed in future pandas releases
s["bin"] = np.floor((n-0.001)*(s["date"] - s["date"].min( )) /((s["date"].max( ) - s["date"].min( ))))

Timespan of groups in Pandas timeseries

Can I receive the covered timespans of groups resulting from groupby operations without using my own lambda function?
Currently I have the below solution but I am wondering if the pandas API not already has this built-in somehow?
To describe what I'm doing in the data prep part: My task is to find out when and especially for how long the boolean flag is True. I found that ndimage.label-ing is an efficient way to deal with non-contiguous data blocks. But I am open to any other cool suggestions!
import pandas as pd
from scipy.ndimage import label
# data preparation
idx = pd.date_range(start='now', periods = 100, freq='min')
df= pd.DataFrame(randn(100), index=idx, columns=['data'])
df['mybool'] = df.data > 0
df['label'] = label(df.mybool)[0]
# my actual question:
df.groupby('label').apply(lambda x:x.index[-1] - x.index[0])
Basically, I subtract the last timestamp from the first for each group.
This results in:
label
0 01:37:00
1 00:00:00
2 00:01:00
3 00:01:00
4 00:01:00
5 00:02:00
6 00:00:00
7 00:10:00
8 00:00:00
9 00:01:00
10 00:02:00
11 00:00:00
12 00:01:00
13 00:04:00
14 00:02:00
15 00:01:00
16 00:00:00
17 00:00:00
18 00:00:00
19 00:01:00
20 00:00:00
21 00:01:00
22 00:02:00
23 00:00:00
24 00:00:00
dtype: timedelta64[ns]
To reiterate my question: Does the pandas API offer a trick that does the same without using applying a lambda function or maybe even without grouping first?
Try like this
In [11]: df
Out[11]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 100 entries, 2013-10-25 00:45:49 to 2013-10-25 02:24:49
Freq: T
Data columns (total 3 columns):
data 100 non-null values
mybool 100 non-null values
label 100 non-null values
dtypes: bool(1), float64(1), int32(1)
In [12]: df['date'] = df.index
In [14]: g = df.groupby('label')['date']
In [15]: g.last()-g.first()
Out[15]:
label
0 01:39:00
1 00:03:00
2 00:00:00
3 00:04:00
4 00:02:00
5 00:00:00
6 00:01:00
7 00:02:00
8 00:08:00
9 00:00:00
10 00:00:00
11 00:06:00
12 00:07:00
13 00:00:00
14 00:00:00
15 00:04:00
16 00:00:00
17 00:01:00
18 00:00:00
19 00:01:00
20 00:00:00
21 00:00:00
22 00:00:00
Name: date, dtype: timedelta64[ns]

Categories

Resources