Pandas delete duplicate rows based on timestamp - python

I have a dataset where I have multiple duplicate records based on timestamps for the same date. I want to keep the record with the max timestamp and delete the other records for a given ID and timestamp combo.
Sample dataset
id|timestamp|value
--|---------|-----
1|2022-04-19T18:46:36.259+0000|xyz
1|2022-04-19T18:46:36.302+0000|xyz
1|2022-04-19T18:46:36.357+0000|xyz
1|2022-04-24T00:41:40:871+0000|xyz
1|2022-04-24T00:41:40:879+0000|xyz
1|2022-05-02T10:15:25.829+0000|xyz
1|2022-05-02T10:15:25.832+0000|xyz
Final Df
id|timestamp|value
--|---------|-----
1|2022-04-19T18:46:36.357+0000|xyz
1|2022-04-24T00:41:40:879+0000|xyz
1|2022-05-02T10:15:25.832+0000|xyz

if you add the data as a code, it'll be easier to share the result. Since you already have a data, its simpler to post it as a code or text
# To keep the lastdate but latest timestamp
# create a dateonly field from timestamp, in identifying the dupicates
# sort values so, we have latest timestamp for an id at the end
# drop duplicates based on id and timestamp. keeping last row
# finally drop the temp column
(df.assign(d=pd.to_datetime(df['timestamp']).dt.date)
.sort_values(['id','timestamp'])
.drop_duplicates(subset=['id','d'], keep='last')
.drop(columns='d')
)
id timestamp value
2 1 2022-04-19T18:46:36.357+0000 xyz
4 1 2022-04-24T00:41:40.879+0000 xyz
6 1 2022-05-02T10:15:25.832+0000 xyz

a combination of .groupby and .max will do
import pandas as pd
dates = pd.to_datetime(['01-01-1990', '01-02-1990', '01-02-1990', '01-03-1990'])
values = [1] * len(dates)
ids = values[:]
df = pd.DataFrame(zip(dates, values, ids), columns=['timestamp', 'val', 'id'])
selection = df.groupby(['val', 'id'])['timestamp'].max().reset_index()
print(selection)
output
val id timestamp
0 1 1 1990-01-03

You can use following code for your task.
df.groupby(["id","value"]).max()
explanation: Fist group by using id and value column and then select only the maximum.

Related

Calculate 3 months unique Emp count for a given month from last 3 months data using pandas

I am looking to calculate last 3 months of unique employee ID count using pandas. I am able to calculate unique employee ID count for current month but not sure how to do it for last 3 months.
df['DateM'] = df['Date'].dt.to_period('M')
df.groupby("DateM")["EmpId"].nunique().reset_index().rename(columns={"EmpId":"One Month Unique EMP count"}).sort_values("DateM",ascending=False).reset_index(drop=True)
testdata.xlsx Google drive link..
https://docs.google.com/spreadsheets/d/1Kaguf72YKIsY7rjYfctHop_OLIgOvIaS/edit?usp=sharing&ouid=117123134308310688832&rtpof=true&sd=true
After using above groupby command I get output for 1 month groups based on DateM column which correct.
Similarly I'm looking for another column where 3 months unique active user count based on EmpId is calculated.
Sample output:
I tried calculating same using rolling window but it doesn't help. Even I tried creating period for last 3 months and also search it before asking this question. Thanks for your help in advance, otherwise I'll have to calculate it manually.
I don't know if you are looking for 3 consecutive months or something else because your date discontinues at 2022-09 to 2022-10.
I also don't know your purpose, so I give a general solution here. In case you only want to count unique for every 3 consecutive months, then it is much easier. The solution here gives you the list of unique empid for every 3 consecutive months. Note that: this means for 2022-08, I will count 3 consecutive months as 2022-08, 2022-09, and 2022-10. And so on
# Sort data:
df.sort_values(by='datem', inplace=True, ignore_index=True)
# Create `dfu` which is `df` with unique `empid` for each `datem` only:
dfu = df.groupby(['datem', 'empid']).count().reset_index()
dfu.rename(columns={'date':'count'}, inplace=True)
dfu.sort_values(by=['datem', 'empid'], inplace=True, ignore_index=True)
dfu
# Obtain the list of unique periods:
unique_period = dfu['datem'].unique()
# Create empty dataframe:
dfe = pd.DataFrame(columns=['datem', 'empid', 'start_period'])
for p in unique_period:
# Create 3 consecutive range:
tem_range = pd.period_range(start=p, freq='M', periods=3)
# Extract dataframe from `dfu` with period in range wanted:
tem_dfu = dfu.loc[dfu['datem'].isin(tem_range),:].copy()
# Some cleaning:
tem_dfu.drop_duplicates(subset='empid', keep='first')
tem_dfu.drop(columns='count', inplace=True)
tem_dfu['start_period'] = p
# Concat and obtain desired output:
dfe = pd.concat([dfe, tem_dfu])
dfe
Hope this is what you are looking for

To get subset of dataframe based on index of a label

I have a dataframe from yahoo finance
import pandas as pd
import yfinance
ticker = yfinance.Ticker("INFY.NS")
df = ticker.history(period = '1y')
print(df)
This gives me df as,
If I specify,
date = "2021-04-23"
I need a subset of df with row having indexes label "2021-04-23"
rows of 2 days before the date
row of 1 day after of date
The important thing here is, we cannot calculate before & after using date strings as df may not have some dates but rows to be printed based on indexes. (i.e. 2 rows of previous indexes and one row of next index)
For example, in df, there is no "2021-04-21" but "2021-04-20"
How can we implement this?
You can go for integer-based indexing. First find the integer location of the desired date and then take the desired subset with iloc:
def get_subset(df, date):
# get the integer index of the matching date(s)
matching_dates_inds, = np.nonzero(df.index == date)
# and take the first one (works in case of duplicates)
first_matching_date_ind = matching_dates_inds[0]
# take the 4-element subset
desired_subset = df.iloc[first_matching_date_ind - 2: first_matching_date_ind + 2]
return desired_subset
If need before and after values by positions (if always exist date in DatetimeIndex) use DataFrame.iloc with position by Index.get_loc with min and max for select rows if not exist values before 2 or after 1 like in sample data:
df = pd.DataFrame({'a':[1,2,3]},
index=pd.to_datetime(['2021-04-21','2021-04-23','2021-04-25']))
date = "2021-04-23"
pos = df.index.get_loc(date)
df = df.iloc[max(0, pos-2):min(len(df), pos+2)]
print (df)
a
2021-04-21 1
2021-04-23 2
2021-04-25 3
Notice:
min and max are added for not failed selecting if date is first (not exist 2 values before, or second - not exist second value before) or last (not exist value after)

Find records for repeated id in the following days based the first day's value using python

I'm trying to find the repeated ids based on the first day's value.
For example, I have records for 4 days:
import pandas as pd
df = pd.DataFrame({'id':['1','2','5','4','2','3','5','4','2','5','2','3','3','4'],
'class':['1','1','0','0','1','1','1','1','0','0','0','0','1','1'],
'day':['1','1','1','1','1','1','1','2','2','3','3','3','4','4']})
df
Given the above data, I'd like to find the records that fit the following conditions:
(1) all the records in day=1 that has class = 0;
(2) On day 2, 3, 4, keep the records if the id satisfies condition (1)--have class=0 on day 1
So the results should be:
df = pd.DataFrame({'id':['5','4','4','5','4'],
'class':['0','0','1','0','1'],
'day':['1','1','2','3','4']})
df
This method would work:
# 1. find unique id in day 1 that meet condition (1)
df1 = df[(df['day']=='1') & (df['class']=='0')]
df1_id = df1.id.unique()
# 2. create a new dataframe for day 2,3,4
df234=df[df['day']!='1']
# 3. create a new dataframe for day2,3,4 that contains the id in the unique list
df234_new = df234[df234['id'].isin(df1_id)]
#4. append df234_new at the end of df1
df_new = df1.append(df234_new)
df_new
But my full dataset contains way more columns and rows, using the above method sound too tedious. Does anyone know how to do it more efficiently? Thank you very much!!

Count each observation a row

I have a pandas df named df, with millions of observations (rows) and only 4 columns.
I'm trying to convert the event_type column into several columns, and add a count to each row for that column.
My df looks like this:
event_type event_time organization_id user_id
0 Applied Saved View 2018-11-22 10:59:57.360 3 0
And I'm looking for this:
Applied_Saved_View event_time organization_id user_id
0 1 2018-11-22 10:59:57.360 3 0
I believe you are looking for something called pd.get_dummies. I assume you are trying to make this categorical data? I have no way of testing without sample data but see code below.
df2 = pd.get_dummies(df['event_type'])
new_df = pd.concat([df2,df],axis=1)
I should mention, you should see how many unique values there are in this event type column because those each will become rows whether its 10 or 100000 unique values

Search in pandas dataframe

Potentially a slightly misleading title but the problem is this:
I have a large dataframe with multiple columns. This looks a bit like
df =
id date value
A 01-01-2015 1.0
A 03-01-2015 1.2
...
B 01-01-2015 0.8
B 02-01-2015 0.8
...
What I want to do is within each of the IDs I identify the date one week earlier and place the value on this date into e.g. a 'lagvalue' column. The problem comes with not all dates existing for all ids so a simple .shift(7) won't pull the correct value [in this instance I guess I should put a NaN in].
I can do this with a lot of horrible iterating over the dates and ids to find the value, for example some rough idea
[
df[
df['date'] == df['date'].iloc[i] - datetime.timedelta(weeks=1)
][
df['id'] == df['id'].iloc[i]
]['value']
for i in range(len(df.index))
]
but I'm certain there is a 'better' way to do it that cuts down on time and processing that I just can't think of right now.
I could write a function using a groupby on the id and then look within that and I'm certain that would reduce the time it would take to perform the operation - is there a much quicker, simpler way [aka am I having a dim day]?
Basic strategy is, for each id, to:
Use date index
Use reindex to expand the data to include all dates
Use shift to shift 7 spots
Use ffill to do last value interpolation. I'm not sure if you want this, or possibly bfill which will use the last value less than a week in the past. But simple to change. Alternatively, if you want NaN when not available 7 days in the past, you can just remove the *fill completely.
Drop unneeded data
This algorithm gives NaN when the lag is too far in the past.
There are a few assumptions here. In particular that the dates are unique inside each id and they are sorted. If not sorted, then use sort_values to sort by id and date. If there are duplicate dates, then some rules will be needed to resolve which values to use.
import pandas as pd
import numpy as np
dates = pd.date_range('2001-01-01',periods=100)
dates = dates[::3]
A = pd.DataFrame({'date':dates,
'id':['A']*len(dates),
'value':np.random.randn(len(dates))})
dates = pd.date_range('2001-01-01',periods=100)
dates = dates[::5]
B = pd.DataFrame({'date':dates,
'id':['B']*len(dates),
'value':np.random.randn(len(dates))})
df = pd.concat([A,B])
with_lags = []
for id, group in df.groupby('id'):
group = group.set_index(group.date)
index = group.index
group = group.reindex(pd.date_range(group.index[0],group.index[-1]))
group = group.ffill()
group['lag_value'] = group.value.shift(7)
group = group.loc[index]
with_lags.append(group)
with_lags = pd.concat(with_lags, 0)
with_lags.index = np.arange(with_lags.shape[0])

Categories

Resources