Pandas: using groupby and nunique taking time into account - python

I have a dataframe in this form:
A B time
1 2 2019-01-03
1 3 2018-04-05
1 4 2020-01-01
1 4 2020-02-02
where A and B contain some integer identifiers.
I want to measure the number of different identifiers each A has interacted with. To do this I usually simply do
df.groupby('A')['B'].nunique()
I now have to do a slightly different thing: each identifier has a date assigned (different for each identifier), that splits its interactions in 2 parts: the ones happening before that date, and the ones happening after that date. The same operation previously done (counting number of unique B interacted with ) needs to be done for both parts separately.
For example, if the date for A=1 was 2018-07-01, the output would be
A before after
1 1 2
In the real data, A contains millions of different identifiers, each with its unique date assigned.
EDITED
To be more clear I added a line to df. I want to count the number of different values of B each A interacts with before and after the date

I would convert A into dates, compare those with df['time'] and then groupby().value_counts():
(df['A'].map(date_dict)
.gt(df['time'])
.groupby(df['A'])
.value_counts()
.unstack()
.rename({False:'after',True:'before'}, axis=1)
)
Output:
after before
A
1 2 1

Related

find max-min values for one column based on another

I have a dataset that looks like this.
datetime id
2020-01-22 11:57:09.286 UTC 5
2020-01-22 11:57:02.303 UTC 6
2020-01-22 11:59:02.303 UTC 5
Ids are not unique and give different datetime values. Let's say:
duration = max(datetime)-min(datetime).
I want to count the ids for what the duration max(datetime)-min(datetime) is less than 2 seconds. So, for example I will output:
count = 1
because of id 5. Then, I want to create a new dataset which contains only those rows with the min(datetime) value for each of the unique ids. So, the new dataset will contain the first row but not the third. The final data set should not have any duplicate ids.
datetime id
2020-01-22 11:57:09.286 UTC 5
2020-01-22 11:57:02.303 UTC 6
How can I do any of these?
P.S: The dataset I provided might not be a good example since the condition is 2 seconds but here it's in minutes
Do you want this? :
df.datetime = pd.to_datetime(df.datetime)
c = 0
def count(x):
global c
x = x.sort_values('datetime')
if len(x) > 1:
diff = (x.iloc[-1]['datetime'] - x.iloc[0]['datetime'])
if diff < timedelta(seconds=2):
c += 1
return x.head(1)
new_df = df.groupby('id').apply(count).reset_index(drop=True)
Now, if you print c it'll show the count which is 1 for this case and new_df will hold the final dataframe.

Python pandas: Get first values of group

I have a list of recorded diagnoses like this:
df = pd.DataFrame({
"DiagnosisTime": ["2017-01-01 08:23:00", "2017-01-01 08:23:00", "2017-01-01 08:23:03", "2017-01-01 08:27:00", "2019-12-31 20:19:39", "2019-12-31 20:19:39"],
"ID": [1,1,1,1,2,2]
})
There are multiple subjects that can be identified by an ID. For each subject there may be one or more diagnosis. Each diagnosis may be comprised of multiple entries (as multiple things are recoreded (not in this example)).
The individual diagnoses (with multiple rows) can (to some extend) be identified by the DiagnosisTime. However, sometimes there is a little delay during the writing of data for one diagnosis so that I want to allow a small tolerance of a few seconds when grouping by DiagnosisTime.
In this example I want a result as follows:
There are two diagnoses for ID 1: rows 0, 1, 2 and row 3. Note the slightly different DiagnosisTime in row 2 compared to 0 and 1. ID 2 is comprised of 1 diagnosis comprised of rows 4 and 5.
For each ID I want to set the counter back to 1 (or 0 if thats easier).
This is how far I've come:
df["DiagnosisTime"] = pd.to_datetime(df["DiagnosisTime"])
df["diagnosis_number"] = df.groupby([pd.Grouper(freq='5S', key="DiagnosisTime"), 'ID']).ngroup()
I think I successfully identified diagnoses within one ID (not entirely sure about the Grouper), but I don't know how to reset the counter.
If that is not possible I would also be satisfied with a function which returns all records of one ID that have the lowest diagnosis_number within that group.
You can add lambda function with GroupBy.transform and factorize:
df["diagnosis_number"] = (df.groupby('ID')['diagnosis_number']
.transform(lambda x: pd.factorize(x)[0]) + 1)
print (df)
DiagnosisTime ID diagnosis_number
0 2017-01-01 08:23:00 1 1
1 2017-01-01 08:23:00 1 1
2 2017-01-01 08:23:03 1 1
3 2017-01-01 08:27:00 1 2
4 2019-12-31 20:19:39 2 1
5 2019-12-31 20:19:39 2 1

How to find missing date rows in a sequence using pandas?

I have a dataframe with more than 4 million rows and 30 columns. I am just providing a sample of my patient dataframe
df = pd.DataFrame({
'subject_ID':[1,1,1,1,1,2,2,2,2,2,3,3,3],
'date_visit':['1/1/2020 12:35:21','1/1/2020 14:35:32','1/1/2020 16:21:20','01/02/2020 15:12:37','01/03/2020 16:32:12',
'1/1/2020 12:35:21','1/3/2020 14:35:32','1/8/2020 16:21:20','01/09/2020 15:12:37','01/10/2020 16:32:12',
'11/01/2022 13:02:31','13/01/2023 17:12:31','16/01/2023 19:22:31'],
'item_name':['PEEP','Fio2','PEEP','Fio2','PEEP','PEEP','PEEP','PEEP','PEEP','PEEP','Fio2','Fio2','Fio2']})
I would like to do two things
1) Find the subjects and their records which are missing in the sequence
2) Get the count of item_name for each subjects
For q2, this is what I tried
df.groupby(['subject_ID','item_name']).count() # though this produces output, column name is not okay. I mean why do it show the count value on `date_visit` column?
For q1, this is what I am trying
df['day'].le(df['shift_date'].add(1))
I expect my output to be like as shown below
You can get the first part with:
In [14]: df.groupby("subject_ID")['item_name'].value_counts().unstack(fill_value=0)
Out[14]:
item_name Fio2 PEEP
subject_ID
1 2 3
2 0 5
3 3 0
EDIT:
I think you've still got your date formats a bit messed up in your sample output, and strongly recommend switching everything to the ISO 8601 standard since that prevents problems like that down the road. pandas won't correctly parse that 11/01/2022 entry on its own, so I've manually fixed it in the sample.
Using what I assume these dates are supposed to be, you can find the gaps by grouping and using .resample():
In [73]: df['dates'] = pd.to_datetime(df['date_visit'])
In [74]: df.loc[10, 'dates'] = pd.to_datetime("2022-01-11 13:02:31")
In [75]: dates = df.groupby("subject_ID").apply(lambda x: x.set_index('dates').resample('D').first())
In [76]: dates.index[dates.isnull().any(axis=1)].to_frame().reset_index(drop=True)
Out[76]:
subject_ID dates
0 2 2020-01-02
1 2 2020-01-04
2 2 2020-01-05
3 2 2020-01-06
4 2 2020-01-07
5 3 2022-01-12
6 3 2022-01-14
7 3 2022-01-15
You can then add seq status to that first frame by checking whether the ID shows up in this new frame.

How to merge by a column of collection using Python Pandas?

I have 2 lists of Stack Overflow questions, group A and group B. Both have two columns, Id and Tag. e.g:
|Id |Tag
| -------- | --------------------------------------------
|2 |c#,winforms,type-conversion,decimal,opacity
For each question in group A, I need to find in group B all matched questions that have at least one overlapping tag the question in group A, independent of the position of tags. For example, these questions should all be matched questions:
|Id |Tag
|----------|---------------------------
|3 |c#
|4 |winforms,type-conversion
|5 |winforms,c#
My first thought was to convert the variable Tag into a set variable and merge using Pandas because set ignores position. However, it seems that Pandas doesn't allow a set variable to be the key variable. So I am now using for loop to search over group B. But it is extremely slow since I have 13 million observation in group B.
My question is:
1. Is there any other way in Python to merge by a column of collection and can tell the number of overlapping tags?
2. How to improve the efficiency of for loop search?
This can be achieved using df.join and df.groupby.
This is the setup I'm working with:
df1 = pd.DataFrame({ 'Id' : [2], 'Tag' : [['c#', 'winforms', 'type-conversion', 'decimal', 'opacity']]})
Id Tag
0 2 [c#, winforms, type-conversion, decimal, opacity]
df2 = pd.DataFrame({ 'Id' : [3, 4, 5], 'Tag' : [['c#'], ['winforms', 'type-conversion'], ['winforms', 'c#']]})
Id Tag
0 3 [c#]
1 4 [winforms, type-conversion]
2 5 [winforms, c#]
Let's flatten out the right column in both data frames. This helped:
In [2331]: from itertools import chain
In [2332]: def flatten(df):
...: return pd.DataFrame({"Id": np.repeat(df.Id.values, df.Tag.str.len()),
...: "Tag": list(chain.from_iterable(df.Tag))})
...:
In [2333]: df1 = flatten(df1)
In [2334]: df2 = flatten(df2)
In [2335]: df1.head()
Out[2335]:
Id Tag
0 2 c#
1 2 winforms
2 2 type-conversion
3 2 decimal
4 2 opacity
And similarly for df2, which is also flattened.
Now is the magic. We'll do a join on Tag column, and then groupby on joined IDs to find count of overlapping tags.
In [2337]: df1.merge(df2, on='Tag').groupby(['Id_x', 'Id_y']).count().reset_index()
Out[2337]:
Id_x Id_y Tag
0 2 3 1
1 2 4 2
2 2 5 2
The output shows each pair of tags along with the number of overlapping tags. Pairs with no overlaps are filtered out by the groupby.
The df.count counts overlapping tags, and df.reset_index just prettifies the output, since groupby assigns the grouped column as the index, so we reset it.
To see matching tags, you'll modify the above slightly:
In [2359]: df1.merge(df2, on='Tag').groupby(['Id_x', 'Id_y'])['Tag'].apply(list).reset_index()
Out[2359]:
Id_x Id_y Tag
0 2 3 [c#]
1 2 4 [winforms, type-conversion]
2 2 5 [c#, winforms]
To filter out 1-overlaps, chain a df.query call to the first expression:
In [2367]: df1.merge(df2, on='Tag').groupby(['Id_x', 'Id_y']).count().reset_index().query('Tag > 1')
Out[2367]:
Id_x Id_y Tag
1 2 4 2
2 2 5 2
Step 1 List down all tags
Step 2 create binary representation of each tag, i.e. use bit 1 or 0 to represent whether have or not have the tag
Step 3 to find any ID share the same tag, you could call a simple apply function to decode your binary representation.
In terms of processing speed, it should be all right. However, if number of tags are too huge, there might be memory issue. If you only need to find questions with same tag for one Id only, I will suggest you write a simple function and calling df.apply. If you need to check for a lot of IDs and find questions with same tag, I will say above approach will be better.
(Intended to leave it as comment, but not enough reputation... sigh)

Comparing values with groups - pandas

First things first, I have a data frame that has these columns:
issue_date | issue | special | group
Multiple rows can comprise the same group. For each group, I want to get its maximum date:
date_current = history.groupby('group').agg({'issue_date' : [np.min, np.max]})
date_current = date_current.issue_date.amax
After that, I want to filter each group by its max_date-months:
date_before = date_current.values - pd.Timedelta(weeks=4*n)
I.e., for each group, I want to discard rows where the column issue_date < date_before:
hh = history[history['issue_date'] > date_before]
ValueError: Lengths must match to compare
This last line doesn't work though, since the the lengths don't match. This is expected because I have x lines in my data frame, but date_before has length equals to the number of groups in my data frame.
Given data, I'm wondering how I can perform this subtraction, or filtering, by groups. Do I have to iterate of the data frame somehow?
You can solve this in a similar manner as you attempted it.
I've created my own sample data as follows:
history
issue_date group
0 2014-01-02 1
1 2014-01-02 2
2 2016-02-04 3
3 2016-03-05 2
You use group_by and apply to do what you were attempting. First you definge the function you want to apply. Then group_by.apply will apply it to every group. In this case I've used n=1 to demonstrate the point:
def date_compare(df):
date_current = df.issue_date.max()
date_before = date_current - pd.Timedelta(weeks=4*1)
hh = df[df['issue_date'] > date_before]
return hh
hh = history.groupby('group').apply(date_compare)
issue_date group
group
1 0 2014-01-02 1
2 3 2016-03-05 2
3 2 2016-02-04 3
So the smaller date in group 2 has not survived the cut.
Hope that's helpful and that it follows the same logic you were going for.
I think your best option will be to merge your original df with date_current, but this will only work if you change your calculation of the date_before such that the group information isn't lost:
date_before = date_current - pd.Timedelta(weeks=4*n)
Then you can merge left on group and right on index(since you grouped on that before)
history = pd.merge(history, date_before.to_frame(), left_on='group', right_index=True)
Then your filter should work. The call of to_frame is nessesary because you can't merge a dataframe and a series.
Hope that helps.

Categories

Resources