Pandas - How to group sequences - python

How to group a pandas (or dask) dataframe and get the min, max and some operation, only when the diference between the grouped rows are 1 second?
MY DATA:
ID
DT
VALOR
1
12:01:00
7
1
12:01:01
1
1
12:01:02
4
1
12:01:03
3
1
12:01:08
1
1
12:01:09
5
2
12:01:09
6
1
12:01:10
6
1
12:01:11
4
RETURN:
ID
MENOR_DT
MAIOR_DT
SOMA
1
12:01:00
12:01:03
15
1
12:01:08
12:01:11
16
2
12:01:09
12:01:09
6

Try:
df["DT"] = pd.to_timedelta(df["DT"])
tmp = df.groupby("ID", group_keys=False)["DT"].apply(
lambda x: (x.diff().bfill() != "1 second").cumsum()
)
df = (
df.groupby(["ID", tmp])
.agg(
ID=("ID", "first"),
MENOR_DT=("DT", "min"),
MAIOR_DT=("DT", "max"),
SOME=("VALOR", "sum"),
)
.reset_index(drop=True)
)
df["MENOR_DT"] = df["MENOR_DT"].astype(str).str.split().str[-1]
df["MAIOR_DT"] = df["MAIOR_DT"].astype(str).str.split().str[-1]
print(df)
Prints:
ID MENOR_DT MAIOR_DT SOME
0 1 12:01:00 12:01:03 15
1 1 12:01:08 12:01:11 16
2 2 12:01:09 12:01:09 6

df['seq'] = np.nan # create a temp column
# sort the DF, find the seconds difference, and update the seq columns
# ffill to group all rows that has a 1 second or less of difference
df['seq']=(df.sort_values(['ID','DT'])
.assign(seq=df['seq']
.mask(pd.to_timedelta(df['DT']).dt.total_seconds()
.diff().ne(1), 1))['seq']
.cumsum()
.ffill()
)
# groupby ID, seq and take the aggregate
# drop the seq columns
(df.groupby(['ID','seq']).agg(MENOR_DT= ('DT','min'),
MAIOR_DT= ('DT','max'),
SOMA = ('VALOR','sum'))
.reset_index()
.drop(columns=['seq']))
ID MENOR_DT MAIOR_DT SOMA
0 1 12:01:00 12:01:03 15
1 1 12:01:08 12:01:11 16
2 2 12:01:09 12:01:09 6

Related

Aggregating the counts on a certain day of the week in python

I'm having this data frame:
id date count
1 8/31/22 1
1 9/1/22 2
1 9/2/22 8
1 9/3/22 0
1 9/4/22 3
1 9/5/22 5
1 9/6/22 1
1 9/7/22 6
1 9/8/22 5
1 9/9/22 7
1 9/10/22 1
2 8/31/22 0
2 9/1/22 2
2 9/2/22 0
2 9/3/22 5
2 9/4/22 1
2 9/5/22 6
2 9/6/22 1
2 9/7/22 1
2 9/8/22 2
2 9/9/22 2
2 9/10/22 0
I want to aggregate the count by id and date to get sum of quantities Details:
Date: the all counts in a week should be aggregated on Saturday. A week starts from Sunday and ends on Saturday. The time period (the first day and the last day of counts) is fixed for all of the ids.
The desired output is given below:
id date count
1 9/3/22 11
1 9/10/22 28
2 9/3/22 7
2 9/10/22 13
I have already the following code for this work and it does work but it is not efficient as it takes a long time to run for a large database. I am looking for a much faster and efficient way to get the output:
df['day_name'] = new_df['date'].dt.day_name()
df_week_count = pd.DataFrame(columns=['id', 'date', 'count'])
for id in ids:
# make a dataframe for each id
df_id = new_df.loc[new_df['id'] == id]
df_id.reset_index(drop=True, inplace=True)
# find Starudays index
saturday_indices = df_id.loc[df_id['day_name'] == 'Saturday'].index
j = 0
sat_index = 0
while(j < len(df_id)):
# find sum of count between j and saturday_index[sat_index]
sum_count = df_id.loc[j:saturday_indices[sat_index], 'count'].sum()
# add id, date, sum_count to df_week_count
temp_df = pd.DataFrame([[id, df_id.loc[saturday_indices[sat_index], 'date'], sum_count]], columns=['id', 'date', 'count'])
df_week_count = pd.concat([df_week_count, temp_df], ignore_index=True)
j = saturday_indices[sat_index] + 1
sat_index += 1
if sat_index >= len(saturday_indices):
break
if(j < len(df_id)):
sum_count = df_id.loc[j:, 'count'].sum()
temp_df = pd.DataFrame([[id, df_id.loc[len(df_id) - 1, 'date'], sum_count]], columns=['id', 'date', 'count'])
df_week_count = pd.concat([df_week_count, temp_df], ignore_index=True)
df_final = df_week_count.copy(deep=True)
Create a grouping factor from the dates.
week = pd.to_datetime(df['date'].to_numpy()).strftime('%U %y')
df.groupby(['id',week]).agg({'date':max, 'count':sum}).reset_index()
id level_1 date count
0 1 35 22 9/3/22 11
1 1 36 22 9/9/22 28
2 2 35 22 9/3/22 7
3 2 36 22 9/9/22 13
I tried to understand as much as i can :)
here is my process
# reading data
df = pd.read_csv(StringIO(data), sep=' ')
# data type fix
df['date'] = pd.to_datetime(df['date'])
# initial grouping
df = df.groupby(['id', 'date'])['count'].sum().to_frame().reset_index()
df.sort_values(by=['date', 'id'], inplace=True)
df.reset_index(drop=True, inplace=True)
# getting name of the day
df['day_name'] = df.date.dt.day_name()
# getting week number
df['week'] = df.date.dt.isocalendar().week
# adjusting week number to make saturday the last day of the week
df.loc[df.day_name == 'Sunday','week'] = df.loc[df.day_name == 'Sunday', 'week'] + 1
what i think you are looking for
df.groupby(['id','week']).agg(count=('count','sum'), date=('date','max')).reset_index()
id
week
count
date
0
1
35
11
2022-09-03 00:00:00
1
1
36
28
2022-09-10 00:00:00
2
2
35
7
2022-09-03 00:00:00
3
2
36
13
2022-09-10 00:00:00

Grouping the data then converting timestamp column to row

I have gone through Convert columns into rows with Pandas and Merge timestamp column in pandas
, the goal is to first group data by ID and then convert start_time column into an entity in the process column
Given
start_time time process ID
14:05 14:16 A 1
14:05 14:34 B 1
14:05 15:00 C 1
14:05 15:10 D 1
14:12 14:19 A 2
14:12 14:54 B 2
Goal
time process ID
14:05 start_time 1 (Previously it was in separate column)
14:16 A 1
14:34 B 1
15:00 C 1
15:10 D 1
14:12 start_time 2
14:19 A 2
14:54 B 2
df.groupby('ID').melt(df.columns.difference(['start_time']), value_name='time')
Note:start_time value in each ID remains the same.
You can treat your data as 2 separate DataFrames and recombine them like so:
# Extract start_times and clean up to match column names
start_times = (
df[['start_time', 'ID']]
.drop_duplicates()
.rename(columns={'start_time': 'time'})
.assign(process='start_time')
)
# combine data vertically
out = (
pd.concat([start_times, df.drop(columns='start_time')])
.sort_values(['ID', 'time'])
.reset_index(drop=True)
)
print(out)
time ID process
0 14:05 1 start_time
1 14:16 1 A
2 14:34 1 B
3 15:00 1 C
4 15:10 1 D
5 14:12 2 start_time
6 14:19 2 A
7 14:54 2 B
You could use:
cols = df.columns.difference(['start_time', 'process']).to_list()
# identify first row per group
mask = ~df.duplicated('ID')
# melt first row per group
df2 = (df[mask]
.drop(columns=['process', 'time'])
.melt(cols, var_name='process', value_name='time')
)
# concatenate with original dataframe and reorder
out = (pd.concat([df2, df])
.sort_values(by='ID', kind='stable')
[['time', 'process']+cols]
#.reset_index(drop=True) # optional
)
output:
time process ID
0 14:05 start_time 1
0 14:16 A 1
1 14:34 B 1
2 15:00 C 1
3 15:10 D 1
1 14:12 start_time 2
4 14:19 A 2
5 14:54 B 2

How to vectorize pandas code where it depends on previous row?

I am trying to vectorize a code snippet in pandas:
I have a pandas dataframe generated like this:
ids
ftest
vals
0
Q52EG
0
0
1
Q52EG
0
1
2
Q52EG
1
2
3
Q52EG
1
3
4
Q52EG
1
4
5
QQ8Q4
0
5
6
QQ8Q4
0
6
7
QQ8Q4
1
7
8
QQ8Q4
1
8
9
QVIPW
1
9
If any id in ids column has a value 1 in the ftest column, then all the subsequent rows with same id should be marked as 1 in has_hist column and it doesnt depend on the current ftest value as shown in the dataframe below:
ids
ftest
vals
has_hist
0
Q52EG
0
0
0
1
Q52EG
0
1
0
2
Q52EG
1
2
0
3
Q52EG
1
3
1
4
Q52EG
1
4
1
5
QQ8Q4
0
5
0
6
QQ8Q4
0
6
0
7
QQ8Q4
1
7
0
8
QQ8Q4
1
8
1
9
QVIPW
1
9
0
I am doing this using a iterative approach like this:
previous_present = {}
has_prv_history = []
for index, value in id_df.iterrows():
my_id = value["ids"]
ftest_mentioned = value["ftest"]
previous_flag = 0
if my_id in previous_present.keys():
previous_flag = 1
elif (ftest_mentioned == 1):
previous_present[my_id] = 1
has_prv_history.append(previous_flag)
id_df["has_hist"] = has_prv_history
Can this code be vectorized without using apply?
Two key functions for this kind of tasks are shift and ffill, applied per group. For this specific question:
df2["has_hist"] = df.groupby("ids").ftest.shift().where(lambda s: s.eq(1))
df2["has_hist"] = df2.groupby("ids").has_hist.ffill().fillna(0).astype("int32")
Here is a variant with transform, which however is often slower than "pure" Pandas operations in my experience:
df2 = (
df
.groupby("ids")
.ftest.transform(
lambda s: (
s
.shift()
.where(lambda t: t.eq(1))
.ffill()
.fillna(0)
.astype("int32")
)
)
)

Create Pandas TimeSeries from Data, Period-range and aggregation function

Context
I'd like to create a time series (with pandas), to count distinct value of an Id if start and end date are within the considered date.
For sake of legibility, this is a simplified version of the problem.
Data
Let's define the Data this way:
df = pd.DataFrame({
'customerId': [
'1', '1', '1', '2', '2'
],
'id': [
'1', '2', '3', '1', '2'
],
'startDate': [
'2000-01', '2000-01', '2000-04', '2000-05', '2000-06',
],
'endDate': [
'2000-08', '2000-02', '2000-07', '2000-07', '2000-08',
],
})
And the period range this way:
period_range = pd.period_range(start='2000-01', end='2000-07', freq='M')
Objectives
For each customerId, there are several distinct id.
The final aim is to get, for each date of the period-range, for each customerId, the count of distinct id whose start_date and end_date matches the function my_date_predicate.
Simplified definition of my_date_predicate:
unset_date = pd.to_datetime("1900-01")
def my_date_predicate(date, row):
return row.startDate <= date and \
(row.endDate.equals(unset_date) or row.endDate > date)
Awaited result
I'd like a time series result like this:
date customerId customerCount
0 2000-01 1 2
1 2000-01 2 0
2 2000-02 1 1
3 2000-02 2 0
4 2000-03 1 1
5 2000-03 2 0
6 2000-04 1 2
7 2000-04 2 0
8 2000-05 1 2
9 2000-05 2 1
10 2000-06 1 2
11 2000-06 2 2
12 2000-07 1 1
13 2000-07 2 0
Question
How could I use pandas to get such result?
Here's a solution:
df.startDate = pd.to_datetime(df.startDate)
df.endDate = pd.to_datetime(df.endDate)
df["month"] = df.apply(lambda row: pd.date_range(row["startDate"], row["endDate"], freq="MS", closed = "left"), axis=1)
df = df.explode("month")
period_range = pd.period_range(start='2000-01', end='2000-07', freq='M')
t = pd.DataFrame(period_range.to_timestamp(), columns=["month"])
customers_df = pd.DataFrame(df.customerId.unique(), columns = ["customerId"])
t = pd.merge(t.assign(dummy=1), customers_df.assign(dummy=1), on = "dummy").drop("dummy", axis=1)
t = pd.merge(t, df, on = ["customerId", "month"], how = "left")
t.groupby(["month", "customerId"]).count()[["id"]].rename(columns={"id": "count"})
The result is:
count
month customerId
2000-01-01 1 2
2 0
2000-02-01 1 1
2 0
2000-03-01 1 1
2 0
2000-04-01 1 2
2 0
2000-05-01 1 2
2 1
2000-06-01 1 2
2 2
2000-07-01 1 1
2 1
Note:
For unset dates, replace the end date with the very last date you're interested in before you start the calculation.
You can do it with 2 pivot_table to get the count of id per customer in column per start date (and end date) in index. reindex each one with the period_date you are interested in. Substract the pivot for end from the pivot for start. Use cumsum to get the cumulative some of id per customer id. Finally use stack and reset_index to bring to the wanted shape.
#convert to period columns like period_date
df['startDate'] = pd.to_datetime(df['startDate']).dt.to_period('M')
df['endDate'] = pd.to_datetime(df['endDate']).dt.to_period('M')
#create the pivots
pvs = (df.pivot_table(index='startDate', columns='customerId', values='id',
aggfunc='count', fill_value=0)
.reindex(period_range, fill_value=0)
)
pve = (df.pivot_table(index='endDate', columns='customerId', values='id',
aggfunc='count', fill_value=0)
.reindex(period_range, fill_value=0)
)
print (pvs)
customerId 1 2
2000-01 2 0 #two id for customer 1 that start at this month
2000-02 0 0
2000-03 0 0
2000-04 1 0
2000-05 0 1 #one id for customer 2 that start at this month
2000-06 0 1
2000-07 0 0
Now you can substract one to the other and use cumsum to get the wanted amount per date.
res = (pvs - pve).cumsum().stack().reset_index()
res.columns = ['date', 'customerId','customerCount']
print (res)
date customerId customerCount
0 2000-01 1 2
1 2000-01 2 0
2 2000-02 1 1
3 2000-02 2 0
4 2000-03 1 1
5 2000-03 2 0
6 2000-04 1 2
7 2000-04 2 0
8 2000-05 1 2
9 2000-05 2 1
10 2000-06 1 2
11 2000-06 2 2
12 2000-07 1 1
13 2000-07 2 1
Note really sure how to handle the unset_date as I don't see what is used for

Ranking over multiple columns in pandas

I have this data frame:
dict_data = {'id' : [1,1,1,2,2,2,2,2],
'datetime' : np.array(['2016-01-03T16:05:52.000000000', '2016-01-03T16:05:52.000000000',
'2016-01-03T16:05:52.000000000', '2016-01-27T15:45:20.000000000',
'2016-01-27T15:45:20.000000000', '2016-11-27T15:08:04.000000000',
'2016-11-27T15:08:04.000000000', '2016-11-27T15:08:04.000000000'], dtype='datetime64[ns]')}
df_data=pd.DataFrame(dict_data)
The data looks like this
Data
I want to rank over customer id and date, I used this code
(df_data.assign(rn=df_data.sort_values(['datetime'], ascending=True)
....: .groupby(['datetime','id'])
....: .cumcount() + 1)
....: .sort_values(['datetime','rn'])
....: )
I get a different rank by ID for each date:
table with rank
What I would like to see is rank by ID but for the same datetime get the same rank for each ID.
Here is how you can rank by datetime and id :
##### RANK BY datetime and id #####
In[]: df_data.rank(axis =0,ascending = 1, method = 'dense')
Out[]:
datetime id
0 1 1
1 1 1
2 1 1
3 2 2
4 2 2
5 3 2
6 3 2
7 3 2
##### GROUPBY id AND USE APPLY TO GET VALUE FOR FOR EACH GROUP #####
In[]: df_data.rank(axis =0,ascending = 1, method = 'dense').groupby('id').apply(lambda x: x)
Out[]:
datetime id
0 1 1
1 1 1
2 1 1
3 2 2
4 2 2
5 3 2
6 3 2
7 3 2
##### THEN RANK INSIDE EACH GROUP #####
In[]: df_data.assign(rank=df_data.rank(axis =0,ascending = 1, method = 'dense').groupby('id').apply(lambda x: x.rank(axis =0,ascending = 1, method = 'dense'))['datetime'])
Out[]:
datetime id rank
0 2016-01-03 16:05:52 1 1
1 2016-01-03 16:05:52 1 1
2 2016-01-03 16:05:52 1 1
3 2016-01-27 15:45:20 2 1
4 2016-01-27 15:45:20 2 1
5 2016-11-27 15:08:04 2 2
6 2016-11-27 15:08:04 2 2
7 2016-11-27 15:08:04 2 2
If you want to change the method of ranking you'll get more info on ranking from the pandas documentation on ranking

Categories

Resources