I have been spinning my wheels with this problem and was wondering if anyone has any insight on how best to approach it. I have a pandas DataFrame with a number of columns, including one datetime64[ns]. I would like to find some way to 'group' records together which have datetimes which are very close to one another. For example, I might be interested in grouping the following transactions together if they occur within two seconds of each other by assigning a common ID called Grouped ID:
Transaction ID Time Grouped ID
1 08:10:02 1
2 08:10:03 1
3 08:10:50
4 08:10:55
5 08:11:00 2
6 08:11:01 2
7 08:11:02 2
8 08:11:03 3
9 08:11:04 3
10 08:15:00
Note that I am not looking to have the time window expand ad infinitum if transactions continue to occur at quick intervals - once a full 2 second window has passed, a new window would begin with the next transaction (as shown in transactions 5 - 9). Additionally, I will ultimately be performing this analysis at the millisecond level (i.e. combine transactions within 50 ms) but stuck with seconds for ease of presentation above.
Thanks very much for any insight you can offer!
The solution i suggest requires you to reindex your data with your Time data.
You can use a list of datetimes with the desired frequency, use searchsorted to find the nearest datetimes in your index, and then use it for slicing (as suggested in question python pandas dataframe slicing by date conditions and Python pandas, how to truncate DatetimeIndex and fill missing data only in certain interval).
I'm using pandas 0.14.1 and the DataOffset object (http://pandas.pydata.org/pandas-docs/dev/timeseries.html?highlight=dateoffset). I didn't check with datetime64, but i guess you might adapt the code. DataOffset goes down to the microsecond level.
Using the following code,
import pandas as pd
import pandas.tseries.offsets as pto
import numpy as np
# Create some ome test data
d_size = 15
df = pd.DataFrame({"value": np.arange(d_size)}, index=pd.date_range("2014/11/03", periods=d_size, freq=pto.Milli()))
# Define periods to define groups (ticks)
ticks = pd.date_range("2014/11/03", periods=d_size/3, freq=5*pto.Milli())
# find nearest indexes matching the ticks
index_ticks = np.unique(df.index.searchsorted(ticks))
# make a dataframe with the group ids
dgroups = pa.DataFrame(index=df.index, columns=['Group id',])
# sets the group ids
for i, (mini, maxi) in enumerate(zip(index_ticks[:-1], index_ticks[1:])):
dgroups.loc[mini:maxi] = i
# update original dataframe
df['Group id'] = dgroups['Group id']
I was able to obtain this kind of dataframe:
value Group id
2014-11-03 00:00:00 0 0
2014-11-03 00:00:00.001000 1 0
2014-11-03 00:00:00.002000 2 0
2014-11-03 00:00:00.003000 3 0
2014-11-03 00:00:00.004000 4 0
2014-11-03 00:00:00.005000 5 1
2014-11-03 00:00:00.006000 6 1
2014-11-03 00:00:00.007000 7 1
2014-11-03 00:00:00.008000 8 1
2014-11-03 00:00:00.009000 9 1
2014-11-03 00:00:00.010000 10 2
2014-11-03 00:00:00.011000 11 2
2014-11-03 00:00:00.012000 12 2
2014-11-03 00:00:00.013000 13 2
2014-11-03 00:00:00.014000 14 2
Related
I have a df that looks like this, it contains frequencies recorded at some specific time and place.
Time Latitude Longitude frequency
0 2022-07-07 00:47:49 31.404463 73.117654 -88.599998
1 2022-07-09 00:13:13 31.442087 73.051086 -88.400002
2 2022-07-13 14:25:45 31.433669 73.118194 -87.500000
3 2022-07-13 17:50:53 31.411087 73.094298 -90.199997
4 2022-07-13 17:50:55 31.411278 73.094554 -89.000000
5 2022-07-14 10:49:13 31.395443 73.108911 -88.000000
6 2022-07-14 10:49:15 31.395436 73.108902 -87.699997
7 2022-07-14 10:49:19 31.395379 73.108847 -87.300003
8 2022-07-14 10:50:29 31.393905 73.107315 -88.000000
9 2022-07-14 10:50:31 31.393879 73.107283 -89.000000
10 2022-07-14 10:50:33 31.393858 73.107265 -89.800003
I want to group all the rows which are just 2 seconds apart (like there are 3 rows index 5-7 which have a time difference of just 2 seconds). Similarly, index 8-10 also have the same difference and I want to place them in a separate group and keep only these unique groups.
so far I have tried this,
df.groupby([pd.Grouper(key='Time', freq='25S')]).frequency.count()
It helps a little as I have to manually insert a time duration in which I am looking for close timestamps records. Still, in my case, I don't have specific time intervals as there can be 50 or more consecutive rows with a gap of 2 seconds in-between for the next two minutes. I just want to keep all these rows in a unique group.
My solution is to greate a column Group which groups the rows for which the difference is small.
First sort the column Time (if necessary): df = df.sort_values('Time').
Now create the groups:
n = 2 # number of seconds
df['Group'] = df.Time.diff().dt.seconds.gt(n).cumsum()
Now you can do
df.groupby('Group').frequency.count()
I have a Panda Dataframe with the following data:
df1[['interval','answer']]
interval answer
0 0 days 06:19:17.767000 no
1 0 days 00:26:35.867000 no
2 0 days 00:29:12.562000 no
3 0 days 01:04:36.362000 no
4 0 days 00:04:28.746000 yes
5 0 days 02:56:56.644000 yes
6 0 days 00:20:13.600000 no
7 0 days 02:31:17.836000 no
8 0 days 02:33:44.575000 no
9 0 days 00:08:08.785000 no
10 0 days 03:48:48.183000 no
11 0 days 00:22:19.327000 no
12 0 days 00:05:05.253000 question
13 0 days 01:08:01.338000 unsubscribe
14 0 days 15:10:30.503000 no
15 0 days 11:09:05.824000 no
16 1 days 12:56:07.526000 no
17 0 days 18:10:13.593000 no
18 0 days 02:25:56.299000 no
19 2 days 03:54:57.715000 no
20 0 days 10:11:28.478000 no
21 0 days 01:04:55.025000 yes
22 0 days 13:59:40.622000 yes
The format of the df is:
id object
datum datetime64[ns]
datum2 datetime64[ns]
answer object
interval timedelta64[ns]
dtype: object
As a result the boxplot looks like:
enter image description here
Any idea?
Any help is appreciated...
Robert
Seaborn may help you achieve what you want.
First of all, one needs to make sure the columns are of the type one wants.
In order to recreate your problem, created the same dataframe (and gave it the same name df1). Here one can see the data types of the columns
[In]: df1.dtypes
[Out]:
interval object
answer object
dtype: object
For the column "answers", one can use pandas.factorize as follows
df1['NewAnswer'] = pd.factorize(df1['answer'])[0] + 1
That will create a new column and assign the values 1 to No, 2 to Yes, 3 to Question, 4 to Unscribe.
With this, one can, already, create a box plot using sns.boxplot as
ax = sns.boxplot(x="interval", y="NewAnswer", hue="answer", data=df1)
Which results in the following
The amount of combinations one can do are various, so I will leave only these as OP didn't specify its requirements nor gave an example of the expected output.
Notes:
Make sure you have the required libraries installed.
There may be other visualizations that would work better with these dataframe, here one can see a gallery with examples.
I have a snippet of the dataframe here
TIME Value1 Value2
0 2014-10-02 12:45:03 5 6
1 2014-10-02 12:45:05 6 7
2 2014-10-02 12:45:08 3 5
3 2014-10-02 12:45:09 7 4
..................... ... ...
45 2014-11-03 00:51:09 7 8
Now, I would like to get all the dataframe rows between 2014-10-02 and 2014-11-02 without knowing the exact time to the second. I tried this method as shown below, but I am getting 0 rows.
start_date = pd.to_datetime('2014-10-02 00:00:01')
end_date = pd.to_datetime('2014-11-02 00:00:01')
df.loc[(df['TIME'] > start_date) & (df['TIME'] < end_date)]
But when I put in an exact datetime value like '2014-10-02 12:45:03' for the start date and end date, I get the output. The data has millions of rows and I could not possibly find out the exact time to the second for the start date and end date. I just need to get rows between two dates.
You can try boolean masking:
df.loc[(df['TIME'].dt.date > start_date.date()) & (df['TIME'].dt.date< end_date,date())]
OR
You can also use boolean masking and between() method:
df[df['TIME'].dt.date.between(start_date.date(),end_date.date())]
I am attempting to learn some Pandas that I otherwise would be doing in SQL window functions.
Assume I have the following dataframe which shows different players previous matches played and how many kills they got in each match.
date player kills
2019-01-01 a 15
2019-01-02 b 20
2019-01-03 a 10
2019-03-04 a 20
Throughout the below code I managed to create a groupby where I only show previous summed values of kills (the sum of the players kills excluding the kills he got in the game of the current row).
df['sum_kills'] = df.groupby('player')['kills'].transform(lambda x: x.cumsum().shift())
This creates the following values:
date player kills sum_kills
2019-01-01 a 15 NaN
2019-01-02 b 20 NaN
2019-01-03 a 10 15
2019-03-04 a 20 25
However what I ideally want is the option to include a filter/where clause in the grouped values. So let's say I only wanted to get the summed values from the previous 30 days (1 month). Then my new dataframe should instead look like this:
date player kills sum_kills
2019-01-01 a 15 NaN
2019-01-02 b 20 NaN
2019-01-03 a 10 15
2019-03-04 a 20 NaN
The last row would provide zero summed_kills because no games from player a had been played over the last month. Is this possible somehow?
I think you are a bit in a pinch using groupby and transform. As explained here, transform operates on a single series, so you can't access data of other columns.
groupby and apply does not seem the correct way too, because the custom function is expected to return an aggregated result for the group passed by groupby, but you want a different result for each row.
So the best solution I can propose is to use apply without groupy, and perform all the selection by yourself inside the custom function:
def killcount(x, data, timewin):
"""count the player's kills in a time window before the time of current row.
x: dataframe row
data: full dataframe
timewin: a pandas.Timedelta
"""
return data.loc[(data['date'] < x['date']) #select dates preceding current row
& (data['date'] >= x['date']-timewin) #select dates in the timewin
& (data['player'] == x['player'])]['kills'].sum() #select rows with same player
df['sum_kills'] = df.apply(lambda r : killcount(r, df, pd.Timedelta(30, 'D')), axis=1)
This returns:
date player kills sum_kills
0 2019-01-01 a 15 0
1 2019-01-02 b 20 0
2 2019-01-03 a 10 15
3 2019-03-04 a 20 0
In case you haven't done yet, remember do parse 'date' column to datetime type using pandas.to_datetime otherwise you cannot perform date comparison.
I am trying to make groups of x days within groups of another column. For some reason the grouping behavior is changed when I add another level of grouping.
See toy example below:
Create a random dataframe with 40 consecutive dates, an ID column and random values:
import numpy as np
import pandas as pd
df = pd.DataFrame(
{'dates':pd.date_range('2018-1-1',periods=40,freq='D'),
'id': np.concatenate((np.repeat(1,10),np.repeat(2,30))),
'amount':np.random.random(40)
}
)
I want to group by id first and then make groups of let's say 7 consecutive days within these groups. I do:
(df
.groupby(['id',pd.Grouper(key='dates',freq='7D')])
.amount
.agg(['mean','count'])
)
And the output is:
mean count
id dates
1 2018-01-01 0.591755 7
2018-01-08 0.701657 3
2 2018-01-08 0.235837 4
2018-01-15 0.650085 7
2018-01-22 0.463854 7
2018-01-29 0.643556 7
2018-02-05 0.459864 5
There is something weird going on in the second group! I would expect to see 4 groups of 7 and then a last group of 2. When I run the same code on a dataframe with just the id=2 I do get what I actually expect:
df2=df[df.id==2]
(df2
.groupby(['id',pd.Grouper(key='dates',freq='7D')])
.amount
.agg(['mean','count'])
)
Output
mean count
id dates
2 2018-01-11 0.389343 7
2018-01-18 0.672550 7
2018-01-25 0.486620 7
2018-02-01 0.520816 7
2018-02-08 0.529915 2
What is going on here? Is it first creating a group of 4 in the id=2 group because the last group in id=1 group was only 3 rows? This is not what I want to do!
When you group with both IDs, you have a spillover from the first group into the second when you perform a weekly groupby (because there are not enough days in the last week to complete a full 7 days in group #1). This is obvious when you look at the first date per group:
"2018-01-08" in the first case v/s "2018-01-11".
The workaround is to perform a groupby on id and then apply a resampling operation:
df.groupby('id').apply(
lambda x: x.set_index('dates').amount.resample('7D').count()
)
id dates
1 2018-01-01 7
2018-01-08 3
2 2018-01-11 7
2018-01-18 7
2018-01-25 7
2018-02-01 7
2018-02-08 2
Name: amount, dtype: int64