I have a Pandas DataFrame with the columns:
UserID, Date, (other columns that we can ignore here)
I'm trying to select out only users that have visited on multiple dates. I'm currently doing it with groupby(['UserID', 'Date']) and a for loop, where I drop users with only one result, but I feel like there is a much better way to do this.
Thanks
It depends on exact format of output you want to get, but you can count distinct Dates inside each UserID and get all where this count > 1 (like having count(distinct Date) > 1 in SQL):
>>> df
Date UserID
0 2013-01-01 00:00:00 1
1 2013-01-02 00:00:00 2
2 2013-01-02 00:00:00 2
3 2013-01-02 00:00:00 1
4 2013-01-02 00:00:00 3
>>> g = df.groupby('UserID').Date.nunique()
>>> g
UserID
1 2
2 1
3 1
>>> g > 1
UserID
1 True
2 False
3 False
dtype: bool
>>> g[g > 1]
UserID
1 2
you see that you get UserID = 1 as a result, it's the only user visited on multiple dates
To count unique date counts for every UserID:
df.groupby("UserID").Date.agg(lambda s:len(s.unique()))
The you can drop users with only one count.
For the sake of adding another answer, you can also use indexing with list comprehension
DF = pd.DataFrame({'UserID' : [1, 1, 2, 3, 4, 4, 5], 'Data': np.random.rand(7)})
DF.ix[[row for row in DF.index if list(DF.UserID).count(DF.UserID[row])>1]]
This might be as much work as your for loop, but its just another option for you to consider....
Related
I have a database with a column describing the dates when a certain data was collected. However, the dates were inserted as MM-DD (eg., Jul-13) and they are coded as string.
ids = pd.Series([1, 2, 3, 4])
dates = pd.Series(["Jul-29", "Jul-29", "Dec-29", "Apr-22"])
df = pd.DataFrame({"ids" : ids, "dates" : dates})
ids dates
0 1 Jul-29
1 2 Jul-29
2 3 Dec-29
3 4 Apr-22
I would like to insert the year in these dates before converting to date based on a condition. I know that data from December belongs to 2021, whereas the rest of the data was collected in 2022. Therefore I need something like this:
ids dates corrected_dates
0 1 Jul-29 Jul-29-2022
1 2 Jul-29 Jul-29-2022
2 3 Dec-29 Dec-29-2021
3 4 Apr-22 Apr-22-2022
I have tried:
df["corrected_dates"] = np.where("Dec" in df["dates"], df["dates"] + "-2021", df["dates"] + "-2022")
but this resulted in
ids dates corrected_dates
0 1 Jul-29 Jul-29-2022
1 2 Jul-29 Jul-29-2022
2 3 Dec-29 Dec-29-2022
3 4 Apr-22 Apr-22-2022
Therefore, I am probably not coding the conditional properly but I can't find out what I am doing wrong.
I was able to insert the the year in a new column by doing
corrected_dates = []
for date in df["dates"]:
if "Dec" in date:
new_date = date + "-2021"
else:
new_date = date + "-2022"
corrected_dates.append(new_date)
and then df["corrected_dates"] = corrected_dates but this seems too cumbersome (not to mention that I am not sure this would work if there were missing data in df["dates"].
Can anyone help me understand what I am doing wrong when using np.where() or suggest a better alternative than using a for loop?
Thanks
Let us do str.startswith
df['new'] = np.where(df["dates"].str.startswith('Dec'), df["dates"] + "-2021", df["dates"] + "-2022")
df
Out[19]:
ids dates new
0 1 Jul-29 Jul-29-2022
1 2 Jul-29 Jul-29-2022
2 3 Dec-29 Dec-29-2021
3 4 Apr-22 Apr-22-2022
We have the following Pandas Dataframe:
# Stackoverflow question
data = {'category':[1, 2, 3, 1, 2, 3, 1, 2, 3], 'date':['2000-01-01', '2000-01-01', '2000-01-01', '2000-01-02', '2000-01-02', '2000-01-02', '2000-01-03', '2000-01-03', '2000-01-03']}
df = pd.DataFrame(data=data)
df['date'] = pd.to_datetime(df['date'])
df
category date
0 1 2000-01-01
1 2 2000-01-01
2 3 2000-01-01
3 1 2000-01-02
4 2 2000-01-02
5 3 2000-01-02
6 1 2000-01-03
7 2 2000-01-03
8 3 2000-01-03
How can we query this dataframe to find the date 2000-01-02 with category 3? So we are looking for the row with index 5.
It should be accomplished without set_index('date').
The reason for this is as follows, when setting the index on the actual data rather than the example data I receive the following error:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Take a subset of relevant category, subtract the target date, and get idxmin
tmp = df.loc[df.category.eq(3)]
(tmp.date - pd.to_datetime("2000-01-02")).abs().idxmin()
# 5
To get the (first) closest index date with category 3 you could use:
m = df['category'].eq(3)
d = df['date'].sub(pd.Timestamp('2000-01-02')).abs()
d.loc[m].idxmin()
output: 5
df[(df['category']==3) & (df['date']==pd.Timestamp(2000,1,2))]
To get the list of all indices:
df.index[(df['category']==3) & (df['date']==pd.Timestamp(2000,1,2))].tolist()
Okay :)
I have a dataframe with more than 4 million rows and 30 columns. I am just providing a sample of my patient dataframe
df = pd.DataFrame({
'subject_ID':[1,1,1,1,1,2,2,2,2,2,3,3,3],
'date_visit':['1/1/2020 12:35:21','1/1/2020 14:35:32','1/1/2020 16:21:20','01/02/2020 15:12:37','01/03/2020 16:32:12',
'1/1/2020 12:35:21','1/3/2020 14:35:32','1/8/2020 16:21:20','01/09/2020 15:12:37','01/10/2020 16:32:12',
'11/01/2022 13:02:31','13/01/2023 17:12:31','16/01/2023 19:22:31'],
'item_name':['PEEP','Fio2','PEEP','Fio2','PEEP','PEEP','PEEP','PEEP','PEEP','PEEP','Fio2','Fio2','Fio2']})
I would like to do two things
1) Find the subjects and their records which are missing in the sequence
2) Get the count of item_name for each subjects
For q2, this is what I tried
df.groupby(['subject_ID','item_name']).count() # though this produces output, column name is not okay. I mean why do it show the count value on `date_visit` column?
For q1, this is what I am trying
df['day'].le(df['shift_date'].add(1))
I expect my output to be like as shown below
You can get the first part with:
In [14]: df.groupby("subject_ID")['item_name'].value_counts().unstack(fill_value=0)
Out[14]:
item_name Fio2 PEEP
subject_ID
1 2 3
2 0 5
3 3 0
EDIT:
I think you've still got your date formats a bit messed up in your sample output, and strongly recommend switching everything to the ISO 8601 standard since that prevents problems like that down the road. pandas won't correctly parse that 11/01/2022 entry on its own, so I've manually fixed it in the sample.
Using what I assume these dates are supposed to be, you can find the gaps by grouping and using .resample():
In [73]: df['dates'] = pd.to_datetime(df['date_visit'])
In [74]: df.loc[10, 'dates'] = pd.to_datetime("2022-01-11 13:02:31")
In [75]: dates = df.groupby("subject_ID").apply(lambda x: x.set_index('dates').resample('D').first())
In [76]: dates.index[dates.isnull().any(axis=1)].to_frame().reset_index(drop=True)
Out[76]:
subject_ID dates
0 2 2020-01-02
1 2 2020-01-04
2 2 2020-01-05
3 2 2020-01-06
4 2 2020-01-07
5 3 2022-01-12
6 3 2022-01-14
7 3 2022-01-15
You can then add seq status to that first frame by checking whether the ID shows up in this new frame.
At first I have two problems, the first will follow now:
I a dataframe df with many times the same userid and along with it a date and some unimportant other columns:
userid date
0 243 2014-04-01
1 234 2014-12-01
2 234 2015-11-01
3 589 2016-07-01
4 589 2016-03-01
I am currently trying to groupby them by userid and sort the dates descending and cut out the twelve oldest. My code looks like this:
df = df.groupby(['userid'], group_keys=False).agg(lambda x: x.sort_values(['date'], ascending=False, inplace=False).head(12))
And I get this error:
ValueError: cannot copy sequence with size 6 to array axis with dimension 12
At the moment my aim is to avoid to split the dataframe in individual ones.
My second problem is more complex:
I try to find out if the sorted dates (respectively per group of userids) are monthly consecutive. This means if there is an date for one group of userid, for example userid: 234 and date: 2014-04-01, the next entry below must be userid: 234 and date:2014-03-01. There is no focus on the day, only the year and month are important.
And only this consecutive 12 dates should be copied in another dataframe.
A second dataframe df2 contains the same userid, but they are unique and another column is 'code'. Here is an example:
userid code
0 433805 1
24 5448 0
48 3434 1
72 34434 1
96 3202 1
120 23766 1
153 39457 0
168 4113 1
172 3435 5
374 34093 1
I summarize: I try to check if there are 12 consecutive months per userid and copy every correct sequence in another dataframe. For this I have also compare the 'code' from df2.
This is a version of my code:
df['YearMonthDiff'] = df['date'].map(lambda x: 1000*x.year + x.month).diff()
df['id_before'] = df['userid'].shift()
final_df = pd.DataFrame()
for group in df.groupby(['userid'], group_keys=False):
fi = group[1]
if (fi['userid'] <> fi['id_before']) & group['YearMonthDiff'].all(-1.0) & df.loc[fi.userid]['code'] != 5:
final_df.append(group['userid','date', 'consum'])
At first calculated from the date an integer and made diff(). On other posts I saw they shift the column to compare the values from the current row and the row before. Then I made groupby(userid) to iterate over the single groups. Now it's extra ugly I tried to find the beginning of such an userid-group, try to check if there are only consecutive months and the correct 'code'. And at least I append it on the final dataframe.
On of the biggest problems is to compare the row with the following row. I can iterate over them with iterrow(), but I cannot compare them without shift(). There exits a calendar function, but on these I will take a look on the weekend. Sorry for the mess I am new to pandas.
Has anyone an idea how to solve my problem?
for your first problem, try this
df.groupby(by='userid').apply(lambda x: x.sort_values(by='date',ascending=False).iloc[[e for e in range(12) if e <len(x)]])
Using groupby and nlargest, we get the index values of those largest dates. Then we use .loc to get just those rows
df.loc[df.groupby('userid').date.nlargest(12).index.get_level_values(1)]
Consider the dataframe df
dates = pd.date_range('2015-08-08', periods=10)
df = pd.DataFrame(dict(
userid=np.arange(2).repeat(4),
date=np.random.choice(dates, 8, False)
))
print(df)
date userid
0 2015-08-12 0 # <-- keep
1 2015-08-09 0
2 2015-08-11 0
3 2015-08-15 0 # <-- keep
4 2015-08-13 1
5 2015-08-10 1
6 2015-08-17 1 # <-- keep
7 2015-08-16 1 # <-- keep
We'll keep the latest 2 dates per user id
df.loc[df.groupby('userid').date.nlargest(2).index.get_level_values(1)]
date userid
0 2015-08-12 0
3 2015-08-15 0
6 2015-08-17 1
7 2015-08-16 1
Maybe someone is interested, I solved my second problem thus:
I cast the date to an int, calculated the difference and I shift the userid one row, like in my example. And then follows this... found a solution on stackoverflow
gr_ob = df.groupby('userid')
gr_dict = gr_ob.groups
final_df = pd.DataFrame(columns=['userid', 'date', 'consum'])
for group_name in gr_dict.keys():
new_df = gr_ob.get_group(group_name)
if (new_df['userid'].iloc[0] <> new_df['id_before'].iloc[0]) & (new_df['YearMonthDiff'].iloc[1:] == -1.0).all() & (len(new_df) == 12):
final_df = final_df.append(new_df[['userid', 'date', 'consum']])
>>> pd.version.short_version
'0.15.2'
>>> D = [{'time':pd.Timestamp('2000/01/01'), 'value':1},{'time':'----', 'value':3}]
>>> pd.DataFrame(D)
time value
0 2000-01-01 1
1 2015-03-03 3
The [1,'time'] cell has been automatically filled in. Seems this happens only when the non-time contains only characters like '-', '/' which are typically used in datetime.
>>> D = [{'time':'0', 'value':3}, {'time':pd.Timestamp('2000/01/01'), 'value':1}]
>>> pd.DataFrame(D)
time value
0 0 3
1 2000-01-01 00:00:00 1
Is this a bug or can I stop this?