Can I receive the covered timespans of groups resulting from groupby operations without using my own lambda function?
Currently I have the below solution but I am wondering if the pandas API not already has this built-in somehow?
To describe what I'm doing in the data prep part: My task is to find out when and especially for how long the boolean flag is True. I found that ndimage.label-ing is an efficient way to deal with non-contiguous data blocks. But I am open to any other cool suggestions!
import pandas as pd
from scipy.ndimage import label
# data preparation
idx = pd.date_range(start='now', periods = 100, freq='min')
df= pd.DataFrame(randn(100), index=idx, columns=['data'])
df['mybool'] = df.data > 0
df['label'] = label(df.mybool)[0]
# my actual question:
df.groupby('label').apply(lambda x:x.index[-1] - x.index[0])
Basically, I subtract the last timestamp from the first for each group.
This results in:
label
0 01:37:00
1 00:00:00
2 00:01:00
3 00:01:00
4 00:01:00
5 00:02:00
6 00:00:00
7 00:10:00
8 00:00:00
9 00:01:00
10 00:02:00
11 00:00:00
12 00:01:00
13 00:04:00
14 00:02:00
15 00:01:00
16 00:00:00
17 00:00:00
18 00:00:00
19 00:01:00
20 00:00:00
21 00:01:00
22 00:02:00
23 00:00:00
24 00:00:00
dtype: timedelta64[ns]
To reiterate my question: Does the pandas API offer a trick that does the same without using applying a lambda function or maybe even without grouping first?
Try like this
In [11]: df
Out[11]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 100 entries, 2013-10-25 00:45:49 to 2013-10-25 02:24:49
Freq: T
Data columns (total 3 columns):
data 100 non-null values
mybool 100 non-null values
label 100 non-null values
dtypes: bool(1), float64(1), int32(1)
In [12]: df['date'] = df.index
In [14]: g = df.groupby('label')['date']
In [15]: g.last()-g.first()
Out[15]:
label
0 01:39:00
1 00:03:00
2 00:00:00
3 00:04:00
4 00:02:00
5 00:00:00
6 00:01:00
7 00:02:00
8 00:08:00
9 00:00:00
10 00:00:00
11 00:06:00
12 00:07:00
13 00:00:00
14 00:00:00
15 00:04:00
16 00:00:00
17 00:01:00
18 00:00:00
19 00:01:00
20 00:00:00
21 00:00:00
22 00:00:00
Name: date, dtype: timedelta64[ns]
Related
I have multiple data frames each having data varying from 1 to 1440 minute (one day).Each dataframes are alike and same columns and same length. The time column values are in hhmm format.
Lets say df_A has the data of 1st day, that is 2021-05-06 It looks like this.
>df_A
timestamp col1 col2..... col80
0
1
2
.
.
.
2359
And the next day's data is in df_B which is also the same. The date is 2021-05-07
>df_B
timestamp col1 col2..... col80
0
1
2
.
.
.
2359
How could I stack these together one under another and create one dataframe while identifying each rows with a column having values in format like YYYYMMDD HH:mm. Which somewhat will look like this:
>df
timestamp col1 col2..... col80
20210506 0000
20210506 0001
.
.
20210506 2359
20210507 0000
.
.
20210507 2359
How could I achieve this while dealing with multiple data frames at ones?
df_A = pd.DataFrame(range(0, 10), columns=['timestamp'])
df_B = pd.DataFrame(range(0, 10), columns=['timestamp'])
df_A['date'] = pd.to_datetime('2021-05-06 ' +
df_A['timestamp'].astype(str).str.zfill(4), format='%Y-%m-%d %H%M')
df_B['date'] = pd.to_datetime('2021-05-07 ' +
df_A['timestamp'].astype(str).str.zfill(4), format='%Y-%m-%d %H%M')
df_final = pd.concat([df_A, df_B])
df_final
timestamp date
0 0 2021-05-06 00:00:00
1 1 2021-05-06 00:01:00
2 2 2021-05-06 00:02:00
3 3 2021-05-06 00:03:00
4 4 2021-05-06 00:04:00
5 5 2021-05-06 00:05:00
6 6 2021-05-06 00:06:00
7 7 2021-05-06 00:07:00
8 8 2021-05-06 00:08:00
9 9 2021-05-06 00:09:00
0 0 2021-05-07 00:00:00
1 1 2021-05-07 00:01:00
2 2 2021-05-07 00:02:00
3 3 2021-05-07 00:03:00
4 4 2021-05-07 00:04:00
5 5 2021-05-07 00:05:00
6 6 2021-05-07 00:06:00
7 7 2021-05-07 00:07:00
8 8 2021-05-07 00:08:00
9 9 2021-05-07 00:09:00
I don't know why the resample function doesn't work properly.
I'm trying to resample with the simplest example but the function does not overwrite the original dataframe.
import pandas as pd
index = pd.date_range('1/1/2019', periods=8, freq='T')
series = pd.Series(range(8), index=index)
series.resample('3T').sum()
printing the resulting series gives me:
print(series)
>2019-01-01 00:00:00 0
>2019-01-01 00:01:00 1
>2019-01-01 00:02:00 2
>2019-01-01 00:03:00 3
>2019-01-01 00:04:00 4
>2019-01-01 00:05:00 5
>2019-01-01 00:06:00 6
>2019-01-01 00:07:00 7
>Freq: T, dtype: int64
In my use case I wanted to upscale a dataframe but that also didn't work.
I have a large csv file in which I want to replace values with zero in a particular range of time. For example in between 20:00:00 to 05:00:00 I want to replace all the values greater than zero with 0. How do I do it?
dff = pd.read_csv('108e.csv', header=None) # reading the data set
data = df.copy()
df = pd.DataFrame(data)
df['timeStamp'] = pd.to_datetime(df['timeStamp'])
for i in df.set_index('timeStamp').between_time('20:00:00' , '05:00:00')['luminosity']:
if( i > 0):
df[['luminosity']] = df[["luminosity"]].replace({i:0})
You can use the function select from numpy.
import numpy as np
df['luminosity'] = np.select((df['timeStamp']>='20:00:00') & (df['timeStamp']<='05:00:00') & (df['luminosity']>=0), 0, df['luminosity'])
Here are other examples to use it and here are the official docs.
Assume that your DataFrame contains:
timeStamp luminosity
0 2020-01-02 18:00:00 10
1 2020-01-02 20:00:00 11
2 2020-01-02 22:00:00 12
3 2020-01-03 02:00:00 13
4 2020-01-03 05:00:00 14
5 2020-01-03 07:00:00 15
6 2020-01-03 18:00:00 16
7 2020-01-03 20:10:00 17
8 2020-01-03 22:10:00 18
9 2020-01-04 02:10:00 19
10 2020-01-04 05:00:00 20
11 2020-01-04 05:10:00 21
12 2020-01-04 07:00:00 22
To only retrieve rows in the time range of interest you could run:
df.set_index('timeStamp').between_time('20:00' , '05:00')
But if you attempted to modify these data, e.g.
df = df.set_index('timeStamp')
df.between_time('20:00' , '05:00')['luminosity'] = 0
you would get SettingWithCopyWarning. The reason is that this function
returns a view of the original data.
To circumvent this limitation, you can use indexer_between_time,
on the index of a DataFrame, which returns a Numpy array - locations
of rows meeting your time range criterion.
To update the underlying data, with setting index only to get row positions,
you can run:
df.iloc[df.set_index('timeStamp').index\
.indexer_between_time('20:00', '05:00'), 1] = 0
Note that to keep the code short, I passed the int location of the column
of interest.
Access by iloc should be quite fast.
When you print the df again, the result is:
timeStamp luminosity
0 2020-01-02 18:00:00 10
1 2020-01-02 20:00:00 0
2 2020-01-02 22:00:00 0
3 2020-01-03 02:00:00 0
4 2020-01-03 05:00:00 0
5 2020-01-03 07:00:00 15
6 2020-01-03 18:00:00 16
7 2020-01-03 20:10:00 0
8 2020-01-03 22:10:00 0
9 2020-01-04 02:10:00 0
10 2020-01-04 05:00:00 0
11 2020-01-04 05:10:00 21
12 2020-01-04 07:00:00 22
I would like to get the 07h00 value every day, from a multiday DataFrame that has 24 hours of minute data in it each day.
import numpy as np
import pandas as pd
aframe = pd.DataFrame([np.arange(10000), np.arange(10000) * 2]).T
aframe.index = pd.date_range("2015-09-01", periods = 10000, freq = "1min")
aframe.head()
Out[174]:
0 1
2015-09-01 00:00:00 0 0
2015-09-01 00:01:00 1 2
2015-09-01 00:02:00 2 4
2015-09-01 00:03:00 3 6
2015-09-01 00:04:00 4 8
aframe.tail()
Out[175]:
0 1
2015-09-07 22:35:00 9995 19990
2015-09-07 22:36:00 9996 19992
2015-09-07 22:37:00 9997 19994
2015-09-07 22:38:00 9998 19996
2015-09-07 22:39:00 9999 19998
In this 10 000 row DataFrame spanning 7 days, how would I get the 7am value each day as efficiently as possible? Assume I might have to do this for very large tick databases so I value speed and low memory usage highly.
I know I can index with strings such as:
aframe.ix["2015-09-02 07:00:00"]
Out[176]:
0 1860
1 3720
Name: 2015-09-02 07:00:00, dtype: int64
But what I need is basically a wildcard style query for example
aframe.ix["* 07:00:00"]
You can use indexer_at_time:
>>> locs = aframe.index.indexer_at_time('7:00:00')
>>> aframe.iloc[locs]
0 1
2015-09-01 07:00:00 420 840
2015-09-02 07:00:00 1860 3720
2015-09-03 07:00:00 3300 6600
2015-09-04 07:00:00 4740 9480
2015-09-05 07:00:00 6180 12360
2015-09-06 07:00:00 7620 15240
2015-09-07 07:00:00 9060 18120
There's also indexer_between_time if you need select all indices that lie between two particular time of day.
Both of these methods return the integer locations of the desired values; the corresponding rows of the Series or DataFrame can be fetched with iloc, as shown above.
I have a dataframe which I want to split into 5 chunks (more generally n chunks), so that I can apply a groupby on the chunks.
I want the chunks to have equal time intervals but in general each group may contain different numbers of records.
Let's call the data
s = pd.Series(pd.date_range('2012-1-1', periods=100, freq='D'))
and the timeinterval ti = (s.max() - s.min())/n
So the first chunk should include all rows with dates between s.min() and s.min() + ti, the second, all rows with dates between s.min() + ti and s.min() + 2*ti, etc.
Can anyone suggest an easy way to achieve this? If somehow I could convert all my dates into seconds since the epoch, then I could do something like thisgroup = floor(thisdate/ti).
Is there an easy 'pythonic' or 'panda-ista' way to do this?
Thanks very much (and Merry Christmas!),
Robin
You can use numpy.array_split:
>>> import pandas as pd
>>> import numpy as np
>>> s = pd.Series(pd.date_range('2012-1-1', periods=10, freq='D'))
>>> np.array_split(s, 5)
[0 2012-01-01 00:00:00
1 2012-01-02 00:00:00
dtype: datetime64[ns], 2 2012-01-03 00:00:00
3 2012-01-04 00:00:00
dtype: datetime64[ns], 4 2012-01-05 00:00:00
5 2012-01-06 00:00:00
dtype: datetime64[ns], 6 2012-01-07 00:00:00
7 2012-01-08 00:00:00
dtype: datetime64[ns], 8 2012-01-09 00:00:00
9 2012-01-10 00:00:00
dtype: datetime64[ns]]
>>> np.array_split(s, 2)
[0 2012-01-01 00:00:00
1 2012-01-02 00:00:00
2 2012-01-03 00:00:00
3 2012-01-04 00:00:00
4 2012-01-05 00:00:00
dtype: datetime64[ns], 5 2012-01-06 00:00:00
6 2012-01-07 00:00:00
7 2012-01-08 00:00:00
8 2012-01-09 00:00:00
9 2012-01-10 00:00:00
dtype: datetime64[ns]]
The answer is as follows:
s = pd.DataFrame(pd.date_range('2012-1-1', periods=20, freq='D'), columns=["date"])
n = 5
s["date"] = np.int64(s) #This step may not be needed in future pandas releases
s["bin"] = np.floor((n-0.001)*(s["date"] - s["date"].min( )) /((s["date"].max( ) - s["date"].min( ))))