I have a pandas dataframe like this: start end value course
start end value course
0 2022-01-01 2022-01-01 10 first
1 2022-01-02 2022-01-02 20 first
2 2022-01-05 2022-01-05 30 second
3 2022-01-04 2022-01-04 40 second
4 2022-01-08 2022-01-08 21 first
5 2022-01-09 2022-01-09 92 first
6 2022-01-10 2022-01-10 55 first
What's the best way to group it like this:
start end value course
0 2022-01-01 2022-01-02 10 first
1 2022-01-04 2022-01-05 30 second
2 2022-01-08 2022-01-10 21 first
There might be more rows with particular course, but the idea is how to group first by the course, and second by one continuous date range? Or maybe it's worth to try to slipt by missing date? The closest case is this one, however it didn't help, since I dont have info about frequency of dates to pass into pd.Grouper(), and I also need to keep start column.
You can create virtual subgroup:
# Convert as DatetimeIndex if necessary
# df['start'] = pd.to_datetime(df['start'])
# df['end'] = pd.to_datetime(df['end'])
is_consecutive = lambda x: x['start'].sub(x['end'].shift()).ne('1D')
df['group'] = (df.sort_values(['start', 'end'])
.groupby('course').apply(is_consecutive)
.cumsum().droplevel('course'))
print(df)
# Output
start end value course group
0 2022-01-01 2022-01-01 10 first 1
1 2022-01-02 2022-01-02 20 first 1
2 2022-01-05 2022-01-05 30 second 3
3 2022-01-04 2022-01-04 40 second 3
4 2022-01-08 2022-01-08 21 first 2
5 2022-01-09 2022-01-09 92 first 2
6 2022-01-10 2022-01-10 55 first 2
Now you can do what you want with these groups.
A possible solution :
df["start"] = pd.to_datetime(df["start"])
df["end"] = pd.to_datetime(df["end"])
aggs = {"start": "first", "end": "last", "value": "first"}
out = (
df
.sort_values(by=["course", "start"])
.assign(next_ = lambda x: x.groupby('course')["start"].shift(-1),
group = lambda x: ((x["next_"] != x["end"] + pd.Timedelta(days=1))
|(x["next_"].isna())).cumsum())
.groupby(["course", "group"], as_index=False).agg(aggs)
.sort_values(by="start", ignore_index=True).drop(columns="group")
[["start", "end", "value", "course"]]
)
Output :
print(out)
start end value course
0 2022-01-01 2022-01-02 10 first
1 2022-01-04 2022-01-05 30 second
2 2022-01-08 2022-01-10 21 first
A cordial greeting, I hope my answer will be of help to you.
The problem as such is that with the current dataframe columns there is no way to group the information as you want, however it is possible to work around this problem.
What you need is to create an ID that identifies each group and with this ID you can group the columns by each group.
This idea is valid only with the assumption that the groups are contiguous courses among themselves and that the order of the dates is not important, this is because in your example the index 2 and 3 the courses are not ordered by column 'start' as is the case for rows with indices 0 and 1 or rows with indices 4, 5 and 6.
With the following line of code you will create a unique id for each group where the comparison between the current line and the superior one will be made and if they are the same, True will be returned and if they are different, False. By making the cumulative sum of the negation of the series of booleans, you can identify each group with a unique id and this ID will auto-increment when this condition is not met.
df['group_id'] = (~df.course.eq(df.course.shift(1))).cumsum()
With this solved, you can group the groups normally with the following code.
df.groupby(by=['group_id']).agg(
start=('start','min'),
end=('end','max'),
value=('value','min'),
course=('course','first')
).reset_index().drop(['group_id'], axis=1)
Greetings.
Related
What I have and am trying to do:
A dataframe, with headers: event_id, location_id, start_date, end_date.
An event can only have one location, start and end.
A location can have multiple events, starts and ends, and they can overlap.
The goal here is to be able to say, given any time T, for location X, how many events were there?
E.g.
Given three events, all for location 2:
Event.
Start.
End.
Event 1
2022-05-01
2022-05-07
Event 2
2022-05-04
2022-05-10
Event 3
2022-05-02
2022-05-05
Time T.
Count of Events
2022-05-01
1
2022-05-02
2
2022-05-03
2
2022-05-04
3
2022-05-05
3
2022-05-06
2
**What I have tried so far, but got stuck on: **
((I did look at THIS possible solution for a similar problem, and I went pretty far with it, but I got lost in the itterows and how to have that apply here.))
Try to get an array or dataframe that has a 365 day date range for each location ID.
E.g.
[1,2022-01-01],[1,2022-01-02]........[98,2022-01-01][98,2022-01-02]
Then convert that array to a dataframe, and merge it with the original dataframe like:
index
location
time
event
location2
start
end
0
1
2022-01-01
1
10
2022-11-07
2022-11-12
1
1
2022-01-01
2
4
2022-02-16
2022-03-05
2
1
2022-01-01
3
99
2022-06-10
2022-06-15
3
1
2022-01-01
4
2
2021-12-31
2022-01-05
4
1
2022-01-01
5
5
2022-05-08
2022-05-22
Then perform some kind of reduction that returns the count:
location
Time
Count
1
2022-01-01
10
1
2022-01-02
3
1
2022-01-03
13
..
...
...
99
2022-01-01
4
99
2022-01-02
0
99
2022-01-03
7
99
2022-01-04
12
I've done something similar with tying events to other events where their dates overalapped, using the .loc(...) but I don't think that would work here, and I'm kind of just stumped.
Where I got stuck was creating an array that combines the location ID and date range, because they're different lengths, and I could figure out how to repeat the location ID for every date in the range.
Anyway, I am 99% positive that there is a much more efficient way of doing this, and really any help at all is greatly appreciated!!
Thank you :)
Update per comment
# get the min and max dates
min_date, max_date = df[['Start.', 'End.']].stack().agg([min, max])
# create a date range
date_range = pd.date_range(min_date, max_date)
# use list comprehension to get the location of dates that are between start and end
new_df = pd.DataFrame({'Date': date_range,
'Location': [df[df['Start.'].le(date) & df['End.'].ge(date)]['Event.'].tolist()
for date in date_range]})
# get the length of each list, which is the count
new_df['Count'] = new_df['Location'].str.len()
Date Location Count
0 2022-05-01 [Event 1] 1
1 2022-05-02 [Event 1, Event 3] 2
2 2022-05-03 [Event 1, Event 3] 2
3 2022-05-04 [Event 1, Event 2, Event 3] 3
4 2022-05-05 [Event 1, Event 2, Event 3] 3
5 2022-05-06 [Event 1, Event 2] 2
6 2022-05-07 [Event 1, Event 2] 2
7 2022-05-08 [Event 2] 1
8 2022-05-09 [Event 2] 1
9 2022-05-10 [Event 2] 1
IIUC you can try something like
# get the min and max dates
min_date, max_date = df[['Start.', 'End.']].stack().agg([min, max])
# create a date range
date_range = pd.date_range(min_date, max_date)
# use list comprehension to get the count of dates that are between start and end
# df.le is less than or equal to
# df.ge is greater than or equal to
new_df = pd.DataFrame({'Date': date_range,
'Count': [sum(df['Start.'].le(date) & df['End.'].ge(date))
for date in date_range]})
Date Count
0 2022-05-01 1
1 2022-05-02 2
2 2022-05-03 2
3 2022-05-04 3
4 2022-05-05 3
5 2022-05-06 2
6 2022-05-07 2
7 2022-05-08 1
8 2022-05-09 1
9 2022-05-10 1
Depending on how large your date range is we may need to take a different approach as things may get slow if you have a range of two years instead of 10 days in the example.
You can also use a custom date range if you do not want to use the min and max values from the whole frame
min_date = '2022-05-01'
max_date = '2022-05-06'
# create a date range
date_range = pd.date_range(min_date, max_date)
# use list comprehension to get the count of dates that are between start and end
new_df = pd.DataFrame({'Date': date_range,
'Count': [sum(df['Start.'].le(date) & df['End.'].ge(date))
for date in date_range]})
Date Count
0 2022-05-01 1
1 2022-05-02 2
2 2022-05-03 2
3 2022-05-04 3
4 2022-05-05 3
5 2022-05-06 2
Note - I wanted to leave the original question up as is, and I was out of space, so I am answering my own question here, but #It_is_Chris is the real MVP.
Update! - with the enormous help from #It_is_Chris and some additional messing around, I was able to use the following code to generate the output I wanted:
# get the min and max dates
min_date, max_date = original_df[['start', 'end']].stack().agg([min, max])
# create a date range
date_range = pd.date_range(min_date, max_date)
# create location range
loc_range = original_df['location'].unique()
# create a new list that combines every date with every location
combined_list = []
for item in date_range:
for location in loc_range:
combined_list.append(
{
'Date':item,
'location':location
}
)
# convert the list to a dataframe
combined_df = pd.DataFrame(combined_list)
# use merge to put original data together with the new dataframe
merged_df = pd.merge(combined_df,original_df, how="left", on="location")
# use loc to directly connect each event to a specific location and time
merged_df = merged_df.loc[(pd.to_datetime(merged_df['Date'])>=pd.to_datetime(merged_df['start'])) & (pd.to_datetime(merged_df['Date'])<=pd.to_datetime(merged_df['end']))]
# use groupby to push out a table as sought Date - Location - Count
output_merged_df = merged_df.groupby(['Date','fleet_id']).size()
The result looked like this:
Note - the sorting was not as I have it here, I believe I would need to add some additional sorting to the dataframe before outputting as a CSV.
Date
location
count
2022-01-01
1
1
2022-01-01
2
4
2022-01-01
3
1
2022-01-01
4
10
2022-01-01
5
3
2022-01-01
6
1
I have a dataframe representing all changes that have been made to a record over time. Among other things, this dataframe contains a record id (in this case not unique and not meant to be as it tracks multiple changes to the same record on a different table), startdate and enddate. Enddate is only included if it is know/preset, often it is not. I would like to map the enddate of each change record to the startdate of the next record in the dataframe with the same id.
>>> thing = pd.DataFrame([
... {'id':1,'startdate':date(2021,1,1),'enddate':date(2022,1,1)},
... {'id':1,'startdate':date(2021,3,24),'enddate':None},
... {'id':1,'startdate':date(2021,5,26),'enddate':None},
... {'id':2,'startdate':date(2021,2,2),'enddate':None},
... {'id':2,'startdate':date(2021,11,26),'enddate':None}
... ])
>>> thing
id startdate enddate
0 1 2021-01-01 2022-01-01
1 1 2021-03-24 None
2 1 2021-05-26 None
3 2 2021-02-02 None
4 2 2021-11-26 None
The dataframe is already sorted by the creation timestamp of the record and the id. I tried this:
thing['enddate'] = thing.groupby('id')['startdate'].apply(lambda x: x.shift())
However the above code only maps this to around 10,000 of my 120,000 rows, the majority of which would have an enddate if I were to do this comparison by hand. Can anyone think of a better way to perform this kind of manipulation? For reference, give the dataframe above I'd like to create this one:
>>> thing
id startdate enddate
0 1 2021-01-01 2021-03-24
1 1 2021-03-24 2021-05-26
2 1 2021-05-26 None
3 2 2021-02-02 2021-11-26
4 2 2021-11-26 None
The idea is that once this transformation is done, I'll have a timeframe between which the configurations stored in the other columns (not impportant for this) were in place
here is one way to do it
use transform with the groupby to assign back the values to the rows
comprising the group
df['enddate']=df.groupby(['id'])['startdate'].transform(lambda x: x.shift(-1))
df
id startdate enddate
0 1 2021-01-01 2021-03-24
1 1 2021-03-24 2021-05-26
2 1 2021-05-26 NaT
3 2 2021-02-02 2021-11-26
4 2 2021-11-26 NaT
Be the following pandas DataFrame in Python:
ID
date
direction
0
2022-01-03 10:00:01
IN
0
2022-01-03 11:00:01
IN
0
2022-01-03 11:10:01
OUT
0
2022-01-03 12:00:03
IN
0
2022-01-03 14:32:01
OUT
1
2022-01-03 10:32:01
OUT
1
2022-01-04 11:32:01
IN
1
2022-01-04 14:32:01
OUT
2
2022-01-02 08:00:01
OUT
2
2022-01-02 08:02:01
IN
I need to get the check-in and check-out for each group by ID. Considering that an entry_date is registered by direction = IN and an exit_date by direction = OUT. The idea is to create the record as follows:
ID
entry_date
exit_date
0
2022-01-03 10:00:01
2022-01-03 11:10:01
0
2022-01-03 12:00:03
2022-01-03 14:32:01
1
2022-01-04 11:32:01
2022-01-04 14:32:01
They are only recorded if a row with IN appears first and closes the record with a row with OUT as value for direction column (there may be more rows in between with a value of direction IN, as can be seen in the example).
I hope you can help me, thank you in advance.
I'm not sure I completely understand, but here's an attempt.
After sorting along the date column (date should be datetime): Group on ID and then eliminate in three steps all rows that don't fit into the required pattern: 1. eliminate OUT-blocks at the beginning of every group. 2. reduce consecutive blocks of IN/OUTs in each group to the first one. 3. Remove the the last row of the group if it is an IN-row. The rest is essentially splitting the resulting dataframe in 2 interlaced parts and concatenating them next to another.
def connect(df):
df = df[df["direction"].cummax()]
df = df[df["direction"].diff().fillna(True)]
if df.shape[0] % 2:
df = df.iloc[:-1]
return df
result = (
df
.assign(direction=df["direction"].replace({"IN": True, "OUT": False}))
.sort_values(["ID", "date"])
.groupby("ID").apply(connect)
.drop(columns="direction")
.droplevel(-1, axis=0)
)
result = pd.concat(
[result.iloc[0::2], result[["date"]].iloc[1::2]], axis=1
).reset_index(drop=True)
result.columns = ["ID", "entry_date", "exit_date"]
Result:
ID entry_date exit_date
0 0 2022-01-03 10:00:01 2022-01-03 11:10:01
1 0 2022-01-03 12:00:03 2022-01-03 14:32:01
2 1 2022-01-04 11:32:01 2022-01-04 14:32:01
Another solution would be to group by ID, and then, in each group, split the directions in 2 stacks, the ins and outs, and run through them to gather the connections:
def connect(df):
mask = df["direction"].eq("IN")
ins = list(reversed(df.loc[mask, "date"].values))
outs = list(reversed(df.loc[~mask, "date"].values))
connects = []
if ins:
dt_out = ins[-1] + pd.Timedelta(days=-1)
while ins and outs:
dt_in = None
while ins:
dt = ins.pop()
if dt > dt_out:
dt_in = dt
break
while dt_in and outs:
dt_out = outs.pop()
if dt_in < dt_out:
connects.append([dt_in, dt_out])
break
return pd.DataFrame(
connects, columns=["entry_date", "exit_date"]
)
result = (
df.sort_values(["ID", "date"]).groupby("ID").apply(connect)
.droplevel(1, axis=0).reset_index()
)
I'd like to mark duplicate records within my data.
A duplicate is defined as: for each person_key, if a person_key is repeated within 25 days of the first. However, after 25 days the count is reset.
So if I had records for a person_key on Day 0, Day 20, Day 26, Day 30, the first and third would be kept, as the third is more than 25 days from the first. The second and fourth are marked as duplicate, since they are within 25 days from the "first of their group".
In other words, I think I need to identify groups of 25 day blocks, and then "dedupe" within those blocks. I'm struggling to start with to create these initial groups.
I will eventually have to apply to 5m records, so am trying to steer clear of pd.DataFrame.apply
person_key date duplicate
A 2019-01-01 False
B 2019-02-01 False
B 2019-02-12 True
C 2019-03-01 False
A 2019-01-10 True
A 2019-01-26 False
A 2019-01-28 True
A 2019-02-10 True
A 2019-04-01 False
Thanks for your help!
Let us group the dataframe by person_key then for each group per person_key again group by the custom grouper with frequency of 25 days and use cumcount to create a sequential counter per subgroup, then compare this counter with 0 to identify the duplicate rows
def find_dupes():
for _, g in df.groupby('person_key', sort=False):
yield from g.groupby(pd.Grouper(key='date', freq='25D')).cumcount().ne(0)
df['duplicate'] = list(find_dupes())
Result
print(df)
person_key date duplicate
0 A 2019-01-01 False
1 B 2019-02-01 False
2 B 2019-02-12 True
3 C 2019-03-01 False
4 A 2019-01-10 True
5 A 2019-01-26 False
6 A 2019-01-28 True
7 A 2019-02-10 True
8 A 2019-04-01 False
I'm having trouble finding an efficient way to update some column values in a large pandas DataFrame.
The code below creates a DataFrame in a similar format to what I'm working with. A summary of the data: the DataFrame contains three days of consumption data with each day being split into 10 periods of measurement. Each measurement period is also recorded during four separate processes being a preliminary reading, end of day reading and two later revisions with all updates being recorded by the Last_Update column with the date.
dates = ['2022-01-01']*40 + ['2022-01-02']*40 + ['2022-01-03']*40
periods = list(range(1,11))*12
versions = (['PRELIM'] * 10 + ['DAILY'] * 10 + ['REVISE'] * 20) * 3
data = {'Date': dates,
'Period' : periods,
'Version': versions,
'Consumption': np.random.randint(1, 30, 120)}
df = pd.DataFrame(data)
df.Date = pd.to_datetime(df.Date)
## Add random times to the REVISE Last_Update values
df['Last_Update'] = df['Date'].apply(lambda x: x + pd.Timedelta(hours=np.random.randint(1,23), minutes=np.random.randint(1,59)))
df['Last_Update'] = df['Last_Update'].where(df.Version == 'REVISE', df['Date'])
The problem is that the two revision categories are both specified by the same value: "REVISE". One of these "REVISE" values must be changed to something like "REVISE_2". If you group the data in the following way df.groupby(['Date', 'Period', 'Version', 'Last_Update'])['Consumption'].sum() you can see there are two Last_Update dates for each period in each day for REVISE. So we need to set the REVISE with the largest date to REVISE_2.
The only way I've managed to find a solution is using a very convoluted function with the apply method to test which date is larger and store its index and then change the value using loc. This ended up taking huge amount of time for small segments of the data (the full dataset is millions of rows).
I feel like there is an easy solution using groupby functions by I'm having difficulties navigating the multi index output.
Any help would be appreciated cheers.
We figure our the index of the max REVISE date using idxmax after some grouping, and then change the labels:
last_revised_date_idx = df[df['Version'] == 'REVISE'].groupby(['Date', 'Period'], group_keys = False)['Last_Update'].idxmax()
df.loc[last_revised_date_idx, 'Version'] = 'REVISE_2'
check the output:
df.groupby(['Date', 'Period', 'Version', 'Last_Update'])['Consumption'].count().head(20)
produces
Date Period Version Last_Update
2022-01-01 1 DAILY 2022-01-01 00:00:00 1
PRELIM 2022-01-01 00:00:00 1
REVISE 2022-01-01 03:50:00 1
REVISE_2 2022-01-01 12:10:00 1
2 DAILY 2022-01-01 00:00:00 1
PRELIM 2022-01-01 00:00:00 1
REVISE 2022-01-01 10:45:00 1
REVISE_2 2022-01-01 22:05:00 1
3 DAILY 2022-01-01 00:00:00 1
PRELIM 2022-01-01 00:00:00 1
REVISE 2022-01-01 17:03:00 1
REVISE_2 2022-01-01 19:10:00 1
4 DAILY 2022-01-01 00:00:00 1
PRELIM 2022-01-01 00:00:00 1
REVISE 2022-01-01 15:23:00 1
REVISE_2 2022-01-01 18:08:00 1
5 DAILY 2022-01-01 00:00:00 1
PRELIM 2022-01-01 00:00:00 1
REVISE 2022-01-01 12:19:00 1
REVISE_2 2022-01-01 18:04:00 1