Drop rows from each group if dates are within a given range - python

Given a DataFrame like below:
dfx = pd.DataFrame({"ID":["A", "A", "C" ,"B", "B"],
"date":["01/01/2014","01/31/2014","01/23/2014","01/01/2014","01/20/2014"]})
I want to remove "duplicates". "duplicates" are defined as those rows where the ID of the rows are the same, but the "date" between them is Less Than 30 days.
The resulting DataFrame upon removal of the "duplicates" is expected appear as:
ID date
A 01/01/2014
A 01/31/2014
C 01/23/2014
B 01/01/2014

Convert date to datetime.
Group date by ID and find difference between consecutive rows
Extract the days component from the timedelta difference and compare it to 30
Filter dfx based on the mask
dfx[~pd.to_datetime(dfx.date).groupby(dfx.ID).diff().dt.days.lt(30)]
ID date
0 A 01/01/2014
1 A 01/31/2014
2 C 01/23/2014
3 B 01/01/2014

Related

how to extract a subset of a dataframe where the date column is larger than a date?

I have a dataframe consist of a date column and other columns.
as a sample see the below,
a = pd.DataFrame({'Date':['2021/2/21', '2021/2/20','2021/3/5','2021/5/30'],
'Number':[2,4,6,9]})
a
Date Number
0 2021/2/21 2
1 2021/2/20 4
2 2021/3/5 6
3 2021/5/30 9
a['Date'].dtypes
Object
neither of the following got me the subset
a = a[a['Date'] > '20/02/2021']
[x for x in a['Date'] if x > '20/02/2021' ]
how can I get the subset?
Use pd.to_datetime and standardize date column
a['Date'] = pd.to_datetime(a.Date)
Then compare using ge i.e. greater than
a['Date'].ge('2021/02/21')

find max-min values for one column based on another

I have a dataset that looks like this.
datetime id
2020-01-22 11:57:09.286 UTC 5
2020-01-22 11:57:02.303 UTC 6
2020-01-22 11:59:02.303 UTC 5
Ids are not unique and give different datetime values. Let's say:
duration = max(datetime)-min(datetime).
I want to count the ids for what the duration max(datetime)-min(datetime) is less than 2 seconds. So, for example I will output:
count = 1
because of id 5. Then, I want to create a new dataset which contains only those rows with the min(datetime) value for each of the unique ids. So, the new dataset will contain the first row but not the third. The final data set should not have any duplicate ids.
datetime id
2020-01-22 11:57:09.286 UTC 5
2020-01-22 11:57:02.303 UTC 6
How can I do any of these?
P.S: The dataset I provided might not be a good example since the condition is 2 seconds but here it's in minutes
Do you want this? :
df.datetime = pd.to_datetime(df.datetime)
c = 0
def count(x):
global c
x = x.sort_values('datetime')
if len(x) > 1:
diff = (x.iloc[-1]['datetime'] - x.iloc[0]['datetime'])
if diff < timedelta(seconds=2):
c += 1
return x.head(1)
new_df = df.groupby('id').apply(count).reset_index(drop=True)
Now, if you print c it'll show the count which is 1 for this case and new_df will hold the final dataframe.

Select a specific group of a grouped dataframe with pandas

I have the following dataframe:
df.index = df['Date']
df.groupby([df.index.month, df['Category'])['Amount'].sum()
Date Category Amount
1 A -125.35
B -40.00
...
12 A 505.15
B -209.00
I would like to report the sum of the Amount for every Category B like:
Date Category Amount
1 B -40.00
...
12 B -209.00
I tried the df.get_group method but this method needs tuple that contains the Date and Category key. Is there a way to filter out only the Categories with B?
You can use IndexSlice:
# groupby here
df_group = df.groupby([df.index.month, df['Category'])['Amount'].sum()
# report only Category B
df_group.loc[pd.IndexSlice[:,'B'],:]
Or query:
# query works with index level name too
df_group.query('Category=="B"')
Output:
Amount
Date Category
1 B -40.0
12 B -209.0
apply a filter to your groupby dataframe where Category equals B
filter=df['Category']=='B'
df[filter].groupby([df.index.month, df['Category'])['Amount'].sum()

Finding consecutive days in the pandas dataframe

I have a dataframe:
ColA ColB
0 A 1/2/2020
1 A 1/3/2020
2 A 1/4/2020
3 A 1/10/2020
4 B 1/3/2020
5 B 1/19/2020
6 C 1/2/2020
7 C 1/7/2020
8 D 1/8/2020
Now I want to find out the name of the series in colA which has three consecutive days in colB.
Output:
the answer would be A since it has 1/2/2020, 1/3/2020 and 1/4/2020 in colB.
A general approach would be like this:
# 1. To make sure the dates are sorted
df = df.sort_values(["ColA", "ColB"])
# 2. Standardize the dates by offseting them
df["ColB_std"] = df["ColB"] - pd.to_timedelta(range(df.shape[0]), 'day')
# 3. Counting each instance of ColA and standardized date
s = df.groupby(["ColA", "ColB_std"])["ColB_std"].count()
# 4. Getting elements from ColA that have at least 1 sequence of at least length 3
colA = s[ s >= 3 ].index.get_level_values(0).unique().values
# 5. Filtering the dataframe
df[ df["ColA"].isin(colA) ]
You want ColAs with 3 consecutive dates. Or you could think of it like you want ColAs where there's a sequence of date, date + 1 day and date + 2 days. By sorting the dataframe by ColA and ColB (1), we know that in the case you want to check, date + 1 day will always follow date, and date + 2 days will be the following that one.
With this, you can standardize the dates by removing n days corresponding to their row. So, the sequence of date, date + 1 day and date + 2 days, becomes date, date and date (2).
Now that we have the date column standardized, we just need to count how many elements each pair ('ColA', 'ColB_std') exist (3), get the elements from ColA that have counts of 3 or more (4), and filter the dataframe (5).
However, this doesn't support duplicated pairs of ('ColA', 'ColB'), for that you'd need to do this first:
df2 = df.drop_duplicates(["ColA", "ColB"])
Proceding to use this df2 in steps 1, 2, 3 and 4, and in the end filtering the real df in step 5.
Previously, I answered that you also could do it like this:
# To make sure the dates are sorted
df = df.sort_values(["ColA", "ColB"])
# Calculating the difference between dates inside each group
s = df.groupby("ColA")["ColB"].diff().dt.days
# Filtering the dataframe
df[ ((s == 1) & (s.shift(1) == 1)).groupby(df["ColA"]).transform("any") ]
The idea is that in s, the difference is always between the previous date, and the current date. However, this doesn't make sure that there are 3 consecutive dates, just 2. By shifting the series by 1, you are make sure that the current different and the previous one are 1 [ (s == 1) & (s.shift(1) == 1) ].
After that, I just groupby(df["ColA"]), and check if any element inside the group is true with transform("any").

How to group a dataframe by date to get an array of ids for each group?

Here is my dataframe:
id - title - publish_up - date
1 - Exampl- 2019-12-1 - datetime
...
I created a date column by applying
df['date'] = pd.to_datetime(df['publish_up'], format='%Y-%m-%d')
I am new in python and I am trying to learn pandas.
What I would like to do is to create groups for each day of the year.
The dataframe contains data from one year span, so in theory, there should be 365 groups.
Then, I would need to get an array of ids for each group.
example:
[{date:'2019-12-1',ids:[1,2,3,4,5,6]},{date:'2019-12-2',ids:[7,8,9,10,11,12,13,14]},...]
Thank you
If want format dates in strings in output list then convert to datetimes is not necessary, only create lists per groups by GroupBy.apply, convert it to DataFrame by DataFrame.reset_index and last create list of dicts by DataFrame.to_dict:
print (df)
id title publish_up date
0 1 Exampl 2019-12-2 datetime
1 2 Exampl 2019-12-2 datetime
2 2 Exampl 2019-12-1 datetime
#if necessary change format 2019-12-1 to 2019-12-01
#df['publish_up'] = pd.to_datetime(df['publish_up'], format='%Y-%m-%d').dt.strftime('%Y-%m-%d')
print (df.groupby('publish_up')['id'].agg(list).reset_index())
publish_up id
0 2019-12-1 [2]
1 2019-12-2 [1, 2]
a = df.groupby('publish_up')['id'].agg(list).reset_index().to_dict('r')
print (a)
[{'publish_up': '2019-12-1', 'id': [2]}, {'publish_up': '2019-12-2', 'id': [1, 2]}]

Categories

Resources