I'd like to mark duplicate records within my data.
A duplicate is defined as: for each person_key, if a person_key is repeated within 25 days of the first. However, after 25 days the count is reset.
So if I had records for a person_key on Day 0, Day 20, Day 26, Day 30, the first and third would be kept, as the third is more than 25 days from the first. The second and fourth are marked as duplicate, since they are within 25 days from the "first of their group".
In other words, I think I need to identify groups of 25 day blocks, and then "dedupe" within those blocks. I'm struggling to start with to create these initial groups.
I will eventually have to apply to 5m records, so am trying to steer clear of pd.DataFrame.apply
person_key date duplicate
A 2019-01-01 False
B 2019-02-01 False
B 2019-02-12 True
C 2019-03-01 False
A 2019-01-10 True
A 2019-01-26 False
A 2019-01-28 True
A 2019-02-10 True
A 2019-04-01 False
Thanks for your help!
Let us group the dataframe by person_key then for each group per person_key again group by the custom grouper with frequency of 25 days and use cumcount to create a sequential counter per subgroup, then compare this counter with 0 to identify the duplicate rows
def find_dupes():
for _, g in df.groupby('person_key', sort=False):
yield from g.groupby(pd.Grouper(key='date', freq='25D')).cumcount().ne(0)
df['duplicate'] = list(find_dupes())
Result
print(df)
person_key date duplicate
0 A 2019-01-01 False
1 B 2019-02-01 False
2 B 2019-02-12 True
3 C 2019-03-01 False
4 A 2019-01-10 True
5 A 2019-01-26 False
6 A 2019-01-28 True
7 A 2019-02-10 True
8 A 2019-04-01 False
Related
What I have and am trying to do:
A dataframe, with headers: event_id, location_id, start_date, end_date.
An event can only have one location, start and end.
A location can have multiple events, starts and ends, and they can overlap.
The goal here is to be able to say, given any time T, for location X, how many events were there?
E.g.
Given three events, all for location 2:
Event.
Start.
End.
Event 1
2022-05-01
2022-05-07
Event 2
2022-05-04
2022-05-10
Event 3
2022-05-02
2022-05-05
Time T.
Count of Events
2022-05-01
1
2022-05-02
2
2022-05-03
2
2022-05-04
3
2022-05-05
3
2022-05-06
2
**What I have tried so far, but got stuck on: **
((I did look at THIS possible solution for a similar problem, and I went pretty far with it, but I got lost in the itterows and how to have that apply here.))
Try to get an array or dataframe that has a 365 day date range for each location ID.
E.g.
[1,2022-01-01],[1,2022-01-02]........[98,2022-01-01][98,2022-01-02]
Then convert that array to a dataframe, and merge it with the original dataframe like:
index
location
time
event
location2
start
end
0
1
2022-01-01
1
10
2022-11-07
2022-11-12
1
1
2022-01-01
2
4
2022-02-16
2022-03-05
2
1
2022-01-01
3
99
2022-06-10
2022-06-15
3
1
2022-01-01
4
2
2021-12-31
2022-01-05
4
1
2022-01-01
5
5
2022-05-08
2022-05-22
Then perform some kind of reduction that returns the count:
location
Time
Count
1
2022-01-01
10
1
2022-01-02
3
1
2022-01-03
13
..
...
...
99
2022-01-01
4
99
2022-01-02
0
99
2022-01-03
7
99
2022-01-04
12
I've done something similar with tying events to other events where their dates overalapped, using the .loc(...) but I don't think that would work here, and I'm kind of just stumped.
Where I got stuck was creating an array that combines the location ID and date range, because they're different lengths, and I could figure out how to repeat the location ID for every date in the range.
Anyway, I am 99% positive that there is a much more efficient way of doing this, and really any help at all is greatly appreciated!!
Thank you :)
Update per comment
# get the min and max dates
min_date, max_date = df[['Start.', 'End.']].stack().agg([min, max])
# create a date range
date_range = pd.date_range(min_date, max_date)
# use list comprehension to get the location of dates that are between start and end
new_df = pd.DataFrame({'Date': date_range,
'Location': [df[df['Start.'].le(date) & df['End.'].ge(date)]['Event.'].tolist()
for date in date_range]})
# get the length of each list, which is the count
new_df['Count'] = new_df['Location'].str.len()
Date Location Count
0 2022-05-01 [Event 1] 1
1 2022-05-02 [Event 1, Event 3] 2
2 2022-05-03 [Event 1, Event 3] 2
3 2022-05-04 [Event 1, Event 2, Event 3] 3
4 2022-05-05 [Event 1, Event 2, Event 3] 3
5 2022-05-06 [Event 1, Event 2] 2
6 2022-05-07 [Event 1, Event 2] 2
7 2022-05-08 [Event 2] 1
8 2022-05-09 [Event 2] 1
9 2022-05-10 [Event 2] 1
IIUC you can try something like
# get the min and max dates
min_date, max_date = df[['Start.', 'End.']].stack().agg([min, max])
# create a date range
date_range = pd.date_range(min_date, max_date)
# use list comprehension to get the count of dates that are between start and end
# df.le is less than or equal to
# df.ge is greater than or equal to
new_df = pd.DataFrame({'Date': date_range,
'Count': [sum(df['Start.'].le(date) & df['End.'].ge(date))
for date in date_range]})
Date Count
0 2022-05-01 1
1 2022-05-02 2
2 2022-05-03 2
3 2022-05-04 3
4 2022-05-05 3
5 2022-05-06 2
6 2022-05-07 2
7 2022-05-08 1
8 2022-05-09 1
9 2022-05-10 1
Depending on how large your date range is we may need to take a different approach as things may get slow if you have a range of two years instead of 10 days in the example.
You can also use a custom date range if you do not want to use the min and max values from the whole frame
min_date = '2022-05-01'
max_date = '2022-05-06'
# create a date range
date_range = pd.date_range(min_date, max_date)
# use list comprehension to get the count of dates that are between start and end
new_df = pd.DataFrame({'Date': date_range,
'Count': [sum(df['Start.'].le(date) & df['End.'].ge(date))
for date in date_range]})
Date Count
0 2022-05-01 1
1 2022-05-02 2
2 2022-05-03 2
3 2022-05-04 3
4 2022-05-05 3
5 2022-05-06 2
Note - I wanted to leave the original question up as is, and I was out of space, so I am answering my own question here, but #It_is_Chris is the real MVP.
Update! - with the enormous help from #It_is_Chris and some additional messing around, I was able to use the following code to generate the output I wanted:
# get the min and max dates
min_date, max_date = original_df[['start', 'end']].stack().agg([min, max])
# create a date range
date_range = pd.date_range(min_date, max_date)
# create location range
loc_range = original_df['location'].unique()
# create a new list that combines every date with every location
combined_list = []
for item in date_range:
for location in loc_range:
combined_list.append(
{
'Date':item,
'location':location
}
)
# convert the list to a dataframe
combined_df = pd.DataFrame(combined_list)
# use merge to put original data together with the new dataframe
merged_df = pd.merge(combined_df,original_df, how="left", on="location")
# use loc to directly connect each event to a specific location and time
merged_df = merged_df.loc[(pd.to_datetime(merged_df['Date'])>=pd.to_datetime(merged_df['start'])) & (pd.to_datetime(merged_df['Date'])<=pd.to_datetime(merged_df['end']))]
# use groupby to push out a table as sought Date - Location - Count
output_merged_df = merged_df.groupby(['Date','fleet_id']).size()
The result looked like this:
Note - the sorting was not as I have it here, I believe I would need to add some additional sorting to the dataframe before outputting as a CSV.
Date
location
count
2022-01-01
1
1
2022-01-01
2
4
2022-01-01
3
1
2022-01-01
4
10
2022-01-01
5
3
2022-01-01
6
1
There are segments of readings that have faulty data and i want to remove entire days which have a least one. I already created the column with the True and False if that segment is wrong.
Example of the dataframe below, since it have more than 100k rows
power_c power_g temperature to_delete
date_time
2019-01-01 00:00:00+00:00 2985 0 10.1 False
2019-01-01 00:05:00+00:00 2258 0 10.1 True
2019-01-01 01:00:00+00:00 2266 0 10.1 False
2019-01-02 00:15:00+00:00 3016 0 10.0 False
2019-01-03 01:20:00+00:00 2265 0 10.0 True
For example the first and second row belong to the same hour on the same day, one of the values has True so i want to delete all rows of that day.
Data always exists in diferences of 5 mins, so i tried to delete 288 items after the True, but since the error is not on the start of the hour it does work as intended.
I am very new to programming and tried a lot of different answers everywhere, i would apreciate very much any help.
Group by the date, then filter out groups that have at least one to_delete.
(df
.groupby(df.index.date)
.apply(lambda sf: None if sf['to_delete'].any() else sf)
.reset_index(level=0, drop=True))
power_c power_g temperature to_delete
date_time
2019-01-02 00:15:00+00:00 3016 0 10.0 False
I'm assuming date_time is a datetime type. If not, convert it first:
df.index = pd.to_datetime(df.index)
I have a column that contains Friday-Friday dates ex. Fri March 4 to Fri March 11. I only want to filter the earliest Friday date. Any suggestions. I figured a way to sort out the min value, but I feel like there's a better method
df['Submitted On'] = pd.to_datetime(df['Submitted On'])
early = df['Submitted On'].min()
df = df.loc[df['Submitted On'] != early]
Although I don't know the use case for your data, your method is a little brittle. If for some reason the range of dates in your column changes, then you're filtering out the earliest date regardless of whether it's a Friday or not.
You can use the .dt.dayofweek method for Series which will return integers 0 through 6 for the day of the week meaning Friday is 4, and filter based on the first occurrence of a Friday. For example:
df = pd.DataFrame({'Submitted On': pd.date_range('2022-03-04','2022-03-11'), 'value':range(8)})
df['Submitted On'] = pd.to_datetime(df['Submitted On'])
filtered_df = df.drop(labels=df[df['Submitted On'].dt.dayofweek == 4].index.values[0])
Result:
Submitted On value
1 2022-03-05 1
2 2022-03-06 2
3 2022-03-07 3
4 2022-03-08 4
5 2022-03-09 5
6 2022-03-10 6
7 2022-03-11 7
And note that if I change the date range slightly, it still drops the first Friday:
df = pd.DataFrame({'Submitted On': pd.date_range('2022-03-03','2022-03-12'), 'value':range(10)})
filtered_df = df.drop(labels=df[df['Submitted On'].dt.dayofweek == 4].index.values[0])
Result:
Submitted On value
0 2022-03-03 0
2 2022-03-05 2
3 2022-03-06 3
4 2022-03-07 4
5 2022-03-08 5
6 2022-03-09 6
7 2022-03-10 7
8 2022-03-11 8
9 2022-03-12 9
I am looking to determine the count of string variables in a column across a 3 month data sample. Samples were taken at random times throughout each day. I can group the data by hour, but I require the fidelity of 30 minute intervals (ex. 0500-0600, 0600-0630) on roughly 10k rows of data.
An example of the data:
datetime stringvalues
2018-06-06 17:00 A
2018-06-07 17:30 B
2018-06-07 17:33 A
2018-06-08 19:00 B
2018-06-09 05:27 A
I have tried setting the datetime column as the index, but I cannot figure how to group the data on anything other than 'hour' and I don't have fidelity on the string value count:
df['datetime'] = pd.to_datetime(df['datetime']
df.index = df['datetime']
df.groupby(df.index.hour).count()
Which returns an output similar to:
datetime stringvalues
datetime
5 0 0
6 2 2
7 5 5
8 1 1
...
I researched multi-indexing and resampling to some length the past two days but I have been unable to find a similar question. The desired result would look something like this:
datetime A B
0500 1 2
0530 3 5
0600 4 6
0630 2 0
....
There is no straightforward way to do a TimeGrouper on the time component, so we do this in two steps:
v = (df.groupby([pd.Grouper(key='datetime', freq='30min'), 'stringvalues'])
.size()
.unstack(fill_value=0))
v.groupby(v.index.time).sum()
stringvalues A B
05:00:00 1 0
17:00:00 1 0
17:30:00 1 1
19:00:00 0 1
I would like to count how many unique weekdays exist in timestamp. Here's an input and I want output to be 4(since 8/5 and 8/6 are weekends).
captureTime
0 8/1/2017 0:05
1 8/2/2017 0:05
2 8/3/2017 0:05
3 8/4/2017 0:05
4 8/5/2017 0:05
5 8/6/2017 0:05
Using np.is_busday:
import numpy as np
import pandas as pd
df = pd.DataFrame( {
'captureTime':[ '8/1/2017 0:05', '8/2/2017 0:05', '8/3/2017 0:05',
'8/4/2017 0:05', '8/5/2017 0:05', '8/6/2017 0:05']})
df['captureTime'] = pd.to_datetime(df['captureTime'])
print(np.is_busday(df['captureTime'].values.astype('datetime64[D]')).sum())
prints
4
Above, all business days are counted once.
If you wish to count identical datetimes only once, you could use
np.is_busday(df['captureTime'].unique().astype('datetime64[D]')).sum()
Or, if you wish to remove datetimes that have identical date components, convert to datetime64[D] dtype before calling np.unique:
np.is_busday(np.unique(df['captureTime'].values.astype('datetime64[D]'))).sum()
One way is pandas series.dt.weekday
df['captureTime'] = pd.to_datetime(df['captureTime'])
np.sum(df['captureTime'].dt.weekday.isin([0,1,2,3,4]))
It returns 4
You can use boolean indexing in case you need to capture the dates
df[df['captureTime'].dt.weekday.isin([0,1,2,3,4])]
captureTime
0 2017-08-01 00:05:00
1 2017-08-02 00:05:00
2 2017-08-03 00:05:00
3 2017-08-04 00:05:00
Convert to date time using pd.to_datetime, get the unique dayofweek list, and count all those under 5.
out = (df.captureTime.apply(pd.to_datetime).dt.dayofweek.unique() < 5).sum()
print(out)
4
df.unique removes duplicates, leaving you with a unique array of daysofweek, on which count occurrences under 5 (0 - 4 -> weekdays).
Output of df.dayofweek:
out = df.captureTime.apply(pd.to_datetime).dt.dayofweek
print(out)
0 1
1 2
2 3
3 4
4 5
5 6
Name: captureTime, dtype: int64
Assuming you have captureTime as datetime object you can do this,
s = df['captureTime'].dt.weekday
s[s >= 5].count() # 5, 6 corresponds to saturday, sunday