I have gone through Convert columns into rows with Pandas and Merge timestamp column in pandas
, the goal is to first group data by ID and then convert start_time column into an entity in the process column
Given
start_time time process ID
14:05 14:16 A 1
14:05 14:34 B 1
14:05 15:00 C 1
14:05 15:10 D 1
14:12 14:19 A 2
14:12 14:54 B 2
Goal
time process ID
14:05 start_time 1 (Previously it was in separate column)
14:16 A 1
14:34 B 1
15:00 C 1
15:10 D 1
14:12 start_time 2
14:19 A 2
14:54 B 2
df.groupby('ID').melt(df.columns.difference(['start_time']), value_name='time')
Note:start_time value in each ID remains the same.
You can treat your data as 2 separate DataFrames and recombine them like so:
# Extract start_times and clean up to match column names
start_times = (
df[['start_time', 'ID']]
.drop_duplicates()
.rename(columns={'start_time': 'time'})
.assign(process='start_time')
)
# combine data vertically
out = (
pd.concat([start_times, df.drop(columns='start_time')])
.sort_values(['ID', 'time'])
.reset_index(drop=True)
)
print(out)
time ID process
0 14:05 1 start_time
1 14:16 1 A
2 14:34 1 B
3 15:00 1 C
4 15:10 1 D
5 14:12 2 start_time
6 14:19 2 A
7 14:54 2 B
You could use:
cols = df.columns.difference(['start_time', 'process']).to_list()
# identify first row per group
mask = ~df.duplicated('ID')
# melt first row per group
df2 = (df[mask]
.drop(columns=['process', 'time'])
.melt(cols, var_name='process', value_name='time')
)
# concatenate with original dataframe and reorder
out = (pd.concat([df2, df])
.sort_values(by='ID', kind='stable')
[['time', 'process']+cols]
#.reset_index(drop=True) # optional
)
output:
time process ID
0 14:05 start_time 1
0 14:16 A 1
1 14:34 B 1
2 15:00 C 1
3 15:10 D 1
1 14:12 start_time 2
4 14:19 A 2
5 14:54 B 2
Related
I'm having this data frame:
id date count
1 8/31/22 1
1 9/1/22 2
1 9/2/22 8
1 9/3/22 0
1 9/4/22 3
1 9/5/22 5
1 9/6/22 1
1 9/7/22 6
1 9/8/22 5
1 9/9/22 7
1 9/10/22 1
2 8/31/22 0
2 9/1/22 2
2 9/2/22 0
2 9/3/22 5
2 9/4/22 1
2 9/5/22 6
2 9/6/22 1
2 9/7/22 1
2 9/8/22 2
2 9/9/22 2
2 9/10/22 0
I want to aggregate the count by id and date to get sum of quantities Details:
Date: the all counts in a week should be aggregated on Saturday. A week starts from Sunday and ends on Saturday. The time period (the first day and the last day of counts) is fixed for all of the ids.
The desired output is given below:
id date count
1 9/3/22 11
1 9/10/22 28
2 9/3/22 7
2 9/10/22 13
I have already the following code for this work and it does work but it is not efficient as it takes a long time to run for a large database. I am looking for a much faster and efficient way to get the output:
df['day_name'] = new_df['date'].dt.day_name()
df_week_count = pd.DataFrame(columns=['id', 'date', 'count'])
for id in ids:
# make a dataframe for each id
df_id = new_df.loc[new_df['id'] == id]
df_id.reset_index(drop=True, inplace=True)
# find Starudays index
saturday_indices = df_id.loc[df_id['day_name'] == 'Saturday'].index
j = 0
sat_index = 0
while(j < len(df_id)):
# find sum of count between j and saturday_index[sat_index]
sum_count = df_id.loc[j:saturday_indices[sat_index], 'count'].sum()
# add id, date, sum_count to df_week_count
temp_df = pd.DataFrame([[id, df_id.loc[saturday_indices[sat_index], 'date'], sum_count]], columns=['id', 'date', 'count'])
df_week_count = pd.concat([df_week_count, temp_df], ignore_index=True)
j = saturday_indices[sat_index] + 1
sat_index += 1
if sat_index >= len(saturday_indices):
break
if(j < len(df_id)):
sum_count = df_id.loc[j:, 'count'].sum()
temp_df = pd.DataFrame([[id, df_id.loc[len(df_id) - 1, 'date'], sum_count]], columns=['id', 'date', 'count'])
df_week_count = pd.concat([df_week_count, temp_df], ignore_index=True)
df_final = df_week_count.copy(deep=True)
Create a grouping factor from the dates.
week = pd.to_datetime(df['date'].to_numpy()).strftime('%U %y')
df.groupby(['id',week]).agg({'date':max, 'count':sum}).reset_index()
id level_1 date count
0 1 35 22 9/3/22 11
1 1 36 22 9/9/22 28
2 2 35 22 9/3/22 7
3 2 36 22 9/9/22 13
I tried to understand as much as i can :)
here is my process
# reading data
df = pd.read_csv(StringIO(data), sep=' ')
# data type fix
df['date'] = pd.to_datetime(df['date'])
# initial grouping
df = df.groupby(['id', 'date'])['count'].sum().to_frame().reset_index()
df.sort_values(by=['date', 'id'], inplace=True)
df.reset_index(drop=True, inplace=True)
# getting name of the day
df['day_name'] = df.date.dt.day_name()
# getting week number
df['week'] = df.date.dt.isocalendar().week
# adjusting week number to make saturday the last day of the week
df.loc[df.day_name == 'Sunday','week'] = df.loc[df.day_name == 'Sunday', 'week'] + 1
what i think you are looking for
df.groupby(['id','week']).agg(count=('count','sum'), date=('date','max')).reset_index()
id
week
count
date
0
1
35
11
2022-09-03 00:00:00
1
1
36
28
2022-09-10 00:00:00
2
2
35
7
2022-09-03 00:00:00
3
2
36
13
2022-09-10 00:00:00
How to group a pandas (or dask) dataframe and get the min, max and some operation, only when the diference between the grouped rows are 1 second?
MY DATA:
ID
DT
VALOR
1
12:01:00
7
1
12:01:01
1
1
12:01:02
4
1
12:01:03
3
1
12:01:08
1
1
12:01:09
5
2
12:01:09
6
1
12:01:10
6
1
12:01:11
4
RETURN:
ID
MENOR_DT
MAIOR_DT
SOMA
1
12:01:00
12:01:03
15
1
12:01:08
12:01:11
16
2
12:01:09
12:01:09
6
Try:
df["DT"] = pd.to_timedelta(df["DT"])
tmp = df.groupby("ID", group_keys=False)["DT"].apply(
lambda x: (x.diff().bfill() != "1 second").cumsum()
)
df = (
df.groupby(["ID", tmp])
.agg(
ID=("ID", "first"),
MENOR_DT=("DT", "min"),
MAIOR_DT=("DT", "max"),
SOME=("VALOR", "sum"),
)
.reset_index(drop=True)
)
df["MENOR_DT"] = df["MENOR_DT"].astype(str).str.split().str[-1]
df["MAIOR_DT"] = df["MAIOR_DT"].astype(str).str.split().str[-1]
print(df)
Prints:
ID MENOR_DT MAIOR_DT SOME
0 1 12:01:00 12:01:03 15
1 1 12:01:08 12:01:11 16
2 2 12:01:09 12:01:09 6
df['seq'] = np.nan # create a temp column
# sort the DF, find the seconds difference, and update the seq columns
# ffill to group all rows that has a 1 second or less of difference
df['seq']=(df.sort_values(['ID','DT'])
.assign(seq=df['seq']
.mask(pd.to_timedelta(df['DT']).dt.total_seconds()
.diff().ne(1), 1))['seq']
.cumsum()
.ffill()
)
# groupby ID, seq and take the aggregate
# drop the seq columns
(df.groupby(['ID','seq']).agg(MENOR_DT= ('DT','min'),
MAIOR_DT= ('DT','max'),
SOMA = ('VALOR','sum'))
.reset_index()
.drop(columns=['seq']))
ID MENOR_DT MAIOR_DT SOMA
0 1 12:01:00 12:01:03 15
1 1 12:01:08 12:01:11 16
2 2 12:01:09 12:01:09 6
I have 2 dataframes
Dataframe1:
id date1
1 11-04-2022
1 03-02-2011
2 03-05-2222
3 01-01-2001
4 02-02-2012
and Dataframe2:
id date2 data data2
1 11-02-2222 1 3
1 11-02-1999 3 4
1 11-03-2022 4 5
2 22-03-4444 5 6
2 22-02-2020 7 8
...
What I would like to do is take the row from dataframe2 with the closest date to date1 in Dataframe1 but it has to fit the id, but the date has to be before the one of date1
The desired output would look like this:
id date1 date2 data data2
1 11-04-2022 11-03-2022 4 5
1 03-02-2011 11-02-1999 3 4
2 03-05-2222 22-02-2020 7 8
How would I do this using pandas?
Try pd.merge_asof, but first convert date1, date2 to datetime and sort both timeframes:
df1["date1"] = pd.to_datetime(df1["date1"])
df2["date2"] = pd.to_datetime(df2["date2"])
df1 = df1.sort_values(by="date1")
df2 = df2.sort_values(by="date2")
print(
pd.merge_asof(
df1,
df2,
by="id",
left_on="date1",
right_on="date2",
direction="nearest",
).dropna(subset=["date2"])
)
Context
I'd like to create a time series (with pandas), to count distinct value of an Id if start and end date are within the considered date.
For sake of legibility, this is a simplified version of the problem.
Data
Let's define the Data this way:
df = pd.DataFrame({
'customerId': [
'1', '1', '1', '2', '2'
],
'id': [
'1', '2', '3', '1', '2'
],
'startDate': [
'2000-01', '2000-01', '2000-04', '2000-05', '2000-06',
],
'endDate': [
'2000-08', '2000-02', '2000-07', '2000-07', '2000-08',
],
})
And the period range this way:
period_range = pd.period_range(start='2000-01', end='2000-07', freq='M')
Objectives
For each customerId, there are several distinct id.
The final aim is to get, for each date of the period-range, for each customerId, the count of distinct id whose start_date and end_date matches the function my_date_predicate.
Simplified definition of my_date_predicate:
unset_date = pd.to_datetime("1900-01")
def my_date_predicate(date, row):
return row.startDate <= date and \
(row.endDate.equals(unset_date) or row.endDate > date)
Awaited result
I'd like a time series result like this:
date customerId customerCount
0 2000-01 1 2
1 2000-01 2 0
2 2000-02 1 1
3 2000-02 2 0
4 2000-03 1 1
5 2000-03 2 0
6 2000-04 1 2
7 2000-04 2 0
8 2000-05 1 2
9 2000-05 2 1
10 2000-06 1 2
11 2000-06 2 2
12 2000-07 1 1
13 2000-07 2 0
Question
How could I use pandas to get such result?
Here's a solution:
df.startDate = pd.to_datetime(df.startDate)
df.endDate = pd.to_datetime(df.endDate)
df["month"] = df.apply(lambda row: pd.date_range(row["startDate"], row["endDate"], freq="MS", closed = "left"), axis=1)
df = df.explode("month")
period_range = pd.period_range(start='2000-01', end='2000-07', freq='M')
t = pd.DataFrame(period_range.to_timestamp(), columns=["month"])
customers_df = pd.DataFrame(df.customerId.unique(), columns = ["customerId"])
t = pd.merge(t.assign(dummy=1), customers_df.assign(dummy=1), on = "dummy").drop("dummy", axis=1)
t = pd.merge(t, df, on = ["customerId", "month"], how = "left")
t.groupby(["month", "customerId"]).count()[["id"]].rename(columns={"id": "count"})
The result is:
count
month customerId
2000-01-01 1 2
2 0
2000-02-01 1 1
2 0
2000-03-01 1 1
2 0
2000-04-01 1 2
2 0
2000-05-01 1 2
2 1
2000-06-01 1 2
2 2
2000-07-01 1 1
2 1
Note:
For unset dates, replace the end date with the very last date you're interested in before you start the calculation.
You can do it with 2 pivot_table to get the count of id per customer in column per start date (and end date) in index. reindex each one with the period_date you are interested in. Substract the pivot for end from the pivot for start. Use cumsum to get the cumulative some of id per customer id. Finally use stack and reset_index to bring to the wanted shape.
#convert to period columns like period_date
df['startDate'] = pd.to_datetime(df['startDate']).dt.to_period('M')
df['endDate'] = pd.to_datetime(df['endDate']).dt.to_period('M')
#create the pivots
pvs = (df.pivot_table(index='startDate', columns='customerId', values='id',
aggfunc='count', fill_value=0)
.reindex(period_range, fill_value=0)
)
pve = (df.pivot_table(index='endDate', columns='customerId', values='id',
aggfunc='count', fill_value=0)
.reindex(period_range, fill_value=0)
)
print (pvs)
customerId 1 2
2000-01 2 0 #two id for customer 1 that start at this month
2000-02 0 0
2000-03 0 0
2000-04 1 0
2000-05 0 1 #one id for customer 2 that start at this month
2000-06 0 1
2000-07 0 0
Now you can substract one to the other and use cumsum to get the wanted amount per date.
res = (pvs - pve).cumsum().stack().reset_index()
res.columns = ['date', 'customerId','customerCount']
print (res)
date customerId customerCount
0 2000-01 1 2
1 2000-01 2 0
2 2000-02 1 1
3 2000-02 2 0
4 2000-03 1 1
5 2000-03 2 0
6 2000-04 1 2
7 2000-04 2 0
8 2000-05 1 2
9 2000-05 2 1
10 2000-06 1 2
11 2000-06 2 2
12 2000-07 1 1
13 2000-07 2 1
Note really sure how to handle the unset_date as I don't see what is used for
I have a relatively big data frame (~10mln rows). It has an id and DateTimeIndex. I have to count a number of entries with a certain id for each row for a period of time (last week\month\year). I've created my own function using relativedelta and storing dates in a separate dictionary {id: [dates]}, but it works extremely slow. How should I do it fast and properly?
P.S.: I've heard about pandas.rolling() but I can't figure out how to use it correctly.
P.P.S.: my function:
def isinrange(date, listdate, delta):
date,listdate = datetime.datetime.strptime(date,format),datetime.datetime.strptime(listdate,format)
return date-delta<=listdate
main code, contains tons of unnecessary operations:
dictionary = dict() #structure {id: [dates]}
for row in df.itertuples():#filling a dictionary
if row.id in dictionary:
dictionary[row.id].append(row.DateTimeIndex)
else:
dictionary[row.id] = [row.DateTimeIndex,]
week,month,year = relativedelta(days =7),relativedelta(months = 1),relativedelta(years = 1)#relative delta init
for row, i in zip(df.itertuples(),range(df.shape[0])):#iterating over dataframe
cnt1=cnt2=cnt3=0 #weekly,monthly, yearly - for each row
for date in dictionary[row.id]:#for each date with an id from row
index_date=row.DateTimeIndex
if date<=index_date: #if date from dictionary is lesser than from a row
if isinrange(index_date,date,year):
cnt1+=1
if isinrange(index_date,date,month):
cnt2+=1
if isinrange(index_date,date,week):
cnt3+=1
df.loc[[i,36],'Weekly'] = cnt1 #add values to a data frame
df.loc[[i,37],'Monthly'] = cnt2
df.loc[[i,38],'Yearly']=cnt3
Sample:
id date
1 2015-05-19
1 2015-05-22
2 2018-02-21
2 2018-02-23
2 2018-02-27
Expected result:
id date last_week
1 2015-05-19 0
1 2015-05-22 1
2 2018-02-21 0
2 2018-02-23 1
2 2018-02-27 2
year_range = ["2018"]
month_range = ["06"]
day_range = [str(x) for x in range(18, 25)]
date_range = [year_range, month_range, day_range]
# df = your dataframe
your_result = df[df.date.apply(lambda x: sum([x.split("-")[i] in date_range[i] for i in range(3)]) == 3)].groupby("id").size().reset_index(name="counts")
print(your_result[:5])
I'm not sure I understood correctly but is it something like this you're looking for?
Took ~15s with a 10 million rows "test" dataframe
id counts
0 0 454063
1 1 454956
2 2 454746
3 3 455317
4 4 454312
Wall time: 14.5 s
The "test" dataframe:
id date
0 4 2018-06-06
1 2 2018-06-18
2 4 2018-06-06
3 3 2018-06-18
4 5 2018-06-06
import pandas as pd
src = "path/data.csv"
df = pd.read_csv(src, sep=",")
print df
# id date
# 0 1 2015-05-19
# 1 1 2015-05-22
# 2 2 2018-02-21
# 3 2 2018-02-23
# 4 2 2018-02-27
# Convert date column to a datetime
df['date'] = pd.to_datetime(df['date'])
# Retrieve rows in the date range
date_ini = '2015-05-18'
date_end = '2016-05-18'
filtered_rows = df.loc[(df['date'] > date_ini) & (df['date'] <= date_end)]
print filtered_rows
# id date
# 0 1 2015-05-19
# 1 1 2015-05-22
# Group rows by id
grouped_by_id = filtered_rows.groupby(['id']).agg(['count'])
print grouped_by_id
# count
# id
# 1 2