I am currently working with session logs and are interested in counting the number of occurrences of specific events in a custom time frame (first 1 [5, 10, after 10]) minute. To simplify: the start of a session is defined as the time of the first occurrence of a relevant event.
I have already filtered the sessions by only the relevant events and the dataframe looks similar to this.
Input
import pandas as pd
data_in = {'SessionId': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
'Timestamp': ['2020-08-24 12:46:30.726000+00:00', '2020-08-24 12:46:38.726000+00:00', '2020-08-24 12:49:30.726000+00:00', '2020-08-24 12:50:49.726000+00:00', '2020-08-24 12:58:30.726000+00:00', '2021-02-12 16:12:12.726000+00:00', '2021-02-12 16:15:24.726000+00:00', '2021-02-12 16:31:07.726000+00:00', '2020-12-03 23:58:17.726000+00:00', '2020-12-04 00:03:44.726000+00:00'],
'event': ['match', 'match', 'match', 'match', 'match', 'match', 'match', 'match', 'match', 'match']
}
df_in = pd.DataFrame(data_in)
df_in
Desired Output:
data_out = {'SessionId': ['A', 'B', 'C'],
'#events_first_1_minute': [2, 1, 1],
'#events_first_5_minute': [4, 2, 1],
'#events_first_10_minute': [4, 2, 2],
'#events_after_10_minute': [5, 3, 2]
}
df_out = pd.DataFrame(data_out)
df_out
I already played around with groupby and pd.Grouper. I get the number of relevant events per session in total, but I don´t see any option for custom time bins. Another idea was also to get rid of the date part and focus only on the time, but there are of course also sessions that started on a day and ended on the other (SessionId: C).
Any help is appreciated!
Using pandas.cut:
df_in['Timestamp'] = pd.to_datetime(df_in['Timestamp'])
bins = ['1min', '5min', '10min']
bins2 = pd.to_timedelta(['0']+bins+['10000days'])
group = pd.cut(df_in.groupby('SessionId')['Timestamp'].apply(lambda x: x-x.min()),
bins=bins2, labels=bins+['>'+bins[-1]]).fillna(bins[0])
(df_in
.groupby(['SessionId', group]).size()
.unstack(level=1)
.cumsum(axis=1)
)
output:
Timestamp 1min 5min 10min >10min
SessionId
A 2 4 4 5
B 1 2 2 3
C 1 1 2 2
First convert your Timestamp column to datetime64 dtype then group by SessionId and aggregate data:
# Not mandatory if it's already the case
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
out = df.groupby('SessionId')['Timestamp'] \
.agg(**{'<1m': lambda x: sum(x < x.iloc[0] + pd.DateOffset(minutes=1)),
'<5m': lambda x: sum(x < x.iloc[0] + pd.DateOffset(minutes=5)),
'<10m': lambda x: sum(x < x.iloc[0] + pd.DateOffset(minutes=10)),
'>10m': 'size'}).reset_index()
Output:
>>> out
SessionId <1m <5m <10m >10m
0 A 2 4 4 5
1 B 1 2 2 3
2 C 1 1 2 2
Use:
#convert values to datetimes
df_in['Timestamp'] = pd.to_datetime(df_in['Timestamp'])
#get minutes by substract minimal datetime per group
df_in['g']=(df_in['Timestamp'].sub(df_in.groupby('SessionId')['Timestamp'].transform('min'))
.dt.total_seconds().div(60)
#binning to intervals
lab = ['events_first_1_minute','events_first_5_minute','events_first_10_minute',
'events_after_10_minute']
df_in['g'] = pd.cut(df_in['g'],
bins=[0, 1, 5, 10, np.inf],
labels=lab, include_lowest=True)
#count values with cumulative sums
df = (pd.crosstab(df_in['SessionId'], df_in['g'])
.cumsum(axis=1)
.rename(columns=str)
.reset_index()
.rename_axis(None, axis=1))
print (df)
SessionId events_first_1_minute events_first_5_minute \
0 A 2 4
1 B 1 2
2 C 1 1
events_first_10_minute events_after_10_minute
0 4 5
1 2 3
2 2 2
Related
I have a dataframe with words as index and a corresponding sentiment score in another column. Then, I have another dataframe which has one column with list of words (token list) with multiple rows. So each row will have a column with different lists. I want to find the average of sentiment score for a particular list. This has to be done for a huge number of rows, and hence efficiency is important.
One method I have in mind is given below:
import pandas as pd
a = [['a', 'b', 'c'], ['hi', 'this', 'is', 'a', 'sample']]
df = pd.DataFrame()
df['tokens'] = a
'''
df
words
0 [a, b, c]
1 [hi, this, is, a, sample]
'''
def find_score(tokenlist, ref_df):
# ref_df contains two cols, 'tokens' and 'score'
temp_df = pd.DataFrame()
temp_df['tokens'] = tokenlist
return temp_df.merge(ref_df, on='tokens', how='inner')['sentiment_score'].mean(axis=0)
# this should return score
df['score'] = df['tokens'].apply(find_score, axis=1, args=(ref_df))
# each input for find_score will be a list
Is there any more efficient way to do it without creating dataframe for each list?
You can create a dictionary for mapping from the reference dataframe ref_df and then use .map() on each token list on each row of dataframe df, as follows:
ref_dict = dict(zip(ref_df['tokens'], ref_df['sentiment_score']))
df['score'] = df['tokens'].map(lambda x: np.mean([ref_dict[y] for y in x if y in ref_dict.keys()]))
Demo
Test Data Construction
a = [['a', 'b', 'c'], ['hi', 'this', 'is', 'a', 'sample']]
df = pd.DataFrame()
df['tokens'] = a
ref_df = pd.DataFrame({'tokens': ['a', 'b', 'c', 'd', 'hi', 'this', 'is', 'sample', 'example'],
'sentiment_score': [1, 2, 3, 4, 11, 12, 13, 14, 15]})
print(df)
tokens
0 [a, b, c]
1 [hi, this, is, a, sample]
print(ref_df)
tokens sentiment_score
0 a 1
1 b 2
2 c 3
3 d 4
4 hi 11
5 this 12
6 is 13
7 sample 14
8 example 15
Run New Code
ref_dict = dict(zip(ref_df['tokens'], ref_df['sentiment_score']))
df['score'] = df['tokens'].map(lambda x: np.mean([ref_dict[y] for y in x if y in ref_dict.keys()]))
Output
print(df)
tokens score
0 [a, b, c] 2.0
1 [hi, this, is, a, sample] 10.2
Let's try explode, merge, and agg:
import pandas as pd
a = [['a', 'b', 'c'], ['hi', 'this', 'is', 'a', 'sample']]
df = pd.DataFrame()
df['tokens'] = a
ref_df = pd.DataFrame({'sentiment_score': {'a': 1, 'b': 2,
'c': 3, 'hi': 4,
'this': 5, 'is': 6,
'sample': 7}})
# Explode Tokens into rows (Preserve original index)
new_df = df.explode('tokens').reset_index()
# Merge sentiment_scores
new_df = new_df.merge(ref_df, left_on='tokens',
right_index=True,
how='inner')
# Group By Original Index and agg back to lists and take mean
new_df = new_df.groupby('index') \
.agg({'tokens': list, 'sentiment_score': 'mean'}) \
.reset_index(drop=True)
print(new_df)
Output:
tokens sentiment_score
0 [a, b, c] 2.0
1 [a, hi, this, is, sample] 4.6
After Explode:
index tokens
0 0 a
1 0 b
2 0 c
3 1 hi
4 1 this
5 1 is
6 1 a
7 1 sample
After Merge
index tokens sentiment_score
0 0 a 1
1 1 a 1
2 0 b 2
3 0 c 3
4 1 hi 4
5 1 this 5
6 1 is 6
7 1 sample 7
(The one-liner)
new_df = df.explode('tokens') \
.reset_index() \
.merge(ref_df, left_on='tokens',
right_index=True,
how='inner') \
.groupby('index') \
.agg({'tokens': list, 'sentiment_score': 'mean'}) \
.reset_index(drop=True)
If the order of the tokens in the list matters, the scores can be calculated and merged back to the original df instead of using list aggregation:
mean_scores = df.explode('tokens') \
.reset_index() \
.merge(ref_df, left_on='tokens',
right_index=True,
how='inner') \
.groupby('index').mean() \
.reset_index(drop=True)
new_df = df.merge(mean_scores,
left_index=True,
right_index=True)
print(new_df)
Output:
tokens sentiment_score
0 [a, b, c] 2.0
1 [hi, this, is, a, sample] 4.6
I read this excellent guide to pivoting but I can't work out how to apply it to my case. I have tidy data like this:
>>> import pandas as pd
>>> df = pd.DataFrame({
... 'case': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', ],
... 'perf_var': ['num', 'time', 'num', 'time', 'num', 'time', 'num', 'time'],
... 'perf_value': [1, 10, 2, 20, 1, 30, 2, 40]
... }
... )
>>>
>>> df
case perf_var perf_value
0 a num 1
1 a time 10
2 a num 2
3 a time 20
4 b num 1
5 b time 30
6 b num 2
7 b time 40
What I want is:
To use "case" as the columns
To use the "num" values as the index
To use the "time" values as the value.
to give:
case a b
1.0 10 30
2.0 20 40
All the pivot examples I can see have the index and values in separate columns, but the above seems like a valid/common "tidy" data case to me (I think?). Is it possible to pivot from this?
You need a bit of preprocessing to get your final result :
(df.assign(num=np.where(df.perf_var == "num",
df.perf_value,
np.nan),
time=np.where(df.perf_var == "time",
df.perf_value,
np.nan))
.assign(num=lambda x: x.num.ffill(),
time=lambda x: x.time.bfill())
.loc[:, ["case", "num", "time"]]
.drop_duplicates()
.pivot("num", "case", "time"))
case a b
num
1.0 10.0 30.0
2.0 20.0 40.0
An alternative route to the same end point :
(
df.set_index(["case", "perf_var"], append=True)
.unstack()
.droplevel(0, 1)
.assign(num=lambda x: x.num.ffill(),
time=lambda x: x.time.bfill())
.drop_duplicates()
.droplevel(0)
.set_index("num", append=True)
.unstack(0)
.rename_axis(index=None)
)
I have df below:
df = pd.DataFrame({
'ID': ['a', 'a', 'a', 'b', 'c', 'c'],
'V1': [False, False, True, True, False, True],
'V2': ['A', 'B', 'C', 'B', 'B', 'C']
})
I want to achieve the following. For each unique ID, the bottom row is True (this is V1). I want to count how many times each unique value of V2 occurs where V1==True. This part would be achieved by something like:
df.groupby('V2').V1.sum()
However, I also want to add, for each unique value of V2, a column indicating how many times that value occurred after the point where V1==True for the V2 value indicated by the row. I understand this might sound confusing; here's how the output woud look like in this example:
df
V2 V1 A B C
0 A 0 0 0 0
1 B 1 0 0 0
2 C 2 1 2 0
It is important that the solution is general enough to be applicable on a similar case with more unique values than just A, B and C.
UPDATE
As a bonus, I am also interested in how, instead of the count, one can instead return the sum of some value column, under the same conditions, divided by the corresponding "count" in the rows. Example: suppose we now depart from df below instead:
df = pd.DataFrame({
'ID': ['a', 'a', 'a', 'b', 'c', 'c'],
'V1': [False, False, True, True, False, True],
'V2': ['A', 'B', 'C', 'B', 'B', 'C'],
'V3': [1, 2, 3, 4, 5, 6],
})
The output would need to sum V3 for the cases indicated by the counts in the solution by #jezrael, and divide that number by V1. The output would instead look like:
df
V2 V1 A B C
0 A 0 0 0 0
1 B 1 0 0 0
2 C 2 1 3.5 0
First aggregate sum:
df1 = df.groupby('V2').V1.sum().astype(int).reset_index()
print (df1)
V2 V1
0 A 0
1 B 1
2 C 2
Then grouping by ID and create heper column by last value by GroupBy.transform and last, then remove last rows of ID by Series.duplicated and use crosstab for counts, add all possible unique values of V2 and last append to df1 by DataFrame.join:
val = df['V2'].unique()
df['new'] = df.groupby('ID').V2.transform('last')
df = df[df.duplicated('ID', keep='last')]
df = pd.crosstab(df['new'], df['V2']).reindex(columns=val, index=val, fill_value=0)
df = df1.join(df, on='V2')
print (df)
V2 V1 A B C
0 A 0 0 0 0
1 B 1 0 0 0
2 C 2 1 2 0
UPDATE
The updated part of the question should be possible to achieve by changing the crosstab part with pivot table:
df = df.pivot_table(
index='n',
columns='V2',
aggfunc=({
'V3': 'mean'
})
).V3.reindex(columns=v, index=v, fill_value=0)
I have a dataframe that looks like so:
df = pd.DataFrame([[0.012343, 'A'], [0.135528, 'A'], [0.198878, 'A'], [0.199999, 'B'], [0.181121, 'B'], [0.199999, 'B']])
df.columns = ['effect', 'category']
effect category
0 0.012343 A
1 0.135528 A
2 0.198878 A
3 0.199999 B
4 0.181121 B
5 0.199999 B
My goal is to get a representative of the frequency distribution of each category. In this case, the bin size would be .05. The resulting dataframe would look like the following:
my_distribution = pd.DataFrame([['A', 1, 0, 1, 1], ['B', 0, 0, 0, 3]])
my_distributions.columns = ['category', '0.0-0.05', '0.05-0.10', 0.1-0.15', '0.15-0.20']
category 0.0-0.05 0.05-0.10 0.1-0.15 0.15-0.20
0 A 1 0 1 1
1 B 0 0 0 3
____________________________________________________________
So, in brief what I am trying to do is create bins and count the number of occurrences in each bin, separated by category. Any help would be really appreciated.
You can use cut followed by crosstab + reindex:
import pandas as pd
df = pd.DataFrame([[0.01, 'A'], [0.13, 'A'], [0.19, 'A'], [0.19, 'B'], [0.18, 'B'], [0.19, 'B']])
df.columns = ['effect', 'category']
labels = ['0.0-0.05', '0.05-0.10', '0.1-0.15', '0.15-0.20']
cuts = df.assign(quant=pd.cut(df.effect, bins=[0.0, 0.05, 0.10, 0.15, 0.20], labels=labels))
# get counts per bin
result = pd.crosstab(cuts.category, columns=cuts.quant)
# reindex with labels to account for bin with 0 counts
result = result.reindex(labels, axis=1).fillna(0).astype(int)
# reset index and rename axis for display purposes
result = result.reset_index().rename_axis(None, axis=1)
print(result)
Output
category 0.0-0.05 0.05-0.10 0.1-0.15 0.15-0.20
0 A 1 0 1 1
1 B 0 0 0 3
I usually use value_counts() to get the number of occurrences of a value. However, I deal now with large database-tables (cannot load it fully into RAM) and query the data in fractions of 1 month.
Is there a way to store the result of value_counts() and merge it with / add it to the next results?
I want to count the number user actions. Assume the following structure of
user-activity logs:
# month 1
id userId actionType
1 1 a
2 1 c
3 2 a
4 3 a
5 3 b
# month 2
id userId actionType
6 1 b
7 1 b
8 2 a
9 3 c
Using value_counts() on those produces:
# month 1
userId
1 2
2 1
3 2
# month 2
userId
1 2
2 1
3 1
Expected output:
# month 1+2
userId
1 4
2 2
3 3
Up until now, I just have found a method using groupby and sum:
# count users actions and remember them in new column
df1['count'] = df1.groupby(['userId'], sort=False)['id'].transform('count')
# delete not necessary columns
df1 = df1[['userId', 'count']]
# delete not necessary rows
df1 = df1.drop_duplicates(subset=['userId'])
# repeat
df2['count'] = df2.groupby(['userId'], sort=False)['id'].transform('count')
df2 = df2[['userId', 'count']]
df2 = df2.drop_duplicates(subset=['userId'])
# merge and sum up
print pd.concat([df1,df2]).groupby(['userId'], sort=False).sum()
What is the pythonic / pandas' way of merging the information of several series' (and dataframes) efficiently?
Let me suggest "add" and specify a fill value of 0. This has an advantage over the previously suggested answer in that it will work when the two Dataframes have non-identical sets of unique keys.
# Create frames
df1 = pd.DataFrame(
{'User_id': ['a', 'a', 'b', 'c', 'c', 'd'], 'a': [1, 1, 2, 3, 3, 5]})
df2 = pd.DataFrame(
{'User_id': ['a', 'a', 'b', 'b', 'c', 'c', 'c'], 'a': [1, 1, 2, 2, 3, 3, 4]})
Now add the the two sets of values_counts(). The fill_value argument will handle any NaN values that would arise, in this example, the 'd' that appears in df1, but not df2.
a = df1.User_id.value_counts()
b = df2.User_id.value_counts()
a.add(b,fill_value=0)
You can sum the series generated by the value_counts method directly:
#create frames
df= pd.DataFrame({'User_id': ['a','a','b','c','c'],'a':[1,1,2,3,3]})
df1= pd.DataFrame({'User_id': ['a','a','b','b','c','c','c'],'a':[1,1,2,2,3,3,4]})
sum the series:
df.User_id.value_counts() + df1.User_id.value_counts()
output:
a 4
b 3
c 5
dtype: int64
This is know as "Split-Apply-Combine". It is done in 1 line and 3-4 clicks, using a lambda function as follows.
1️⃣ paste this into your code:
df['total_for_this_label'] = df.groupby('label', as_index=False)['label'].transform(lambda x: x.count())
2️⃣ replace 3x label with the name of the column whose values you are counting (case-sensitive)
3️⃣ print df.head() to check it's worked correctly