Selecting rows based on last occurence of a string in pandas - python

I have a pandas dataframe which looks something like this,
id desc
1 Description
1 02.09.2017 15:00 abcd
1 this is a sample description
1 which is continued here also
1
1 Description
1 01.09.2017 12:00 absd
1 this is another sample description
1 which might be continued here
1 or here
1
2 Description
2 09.03.2017 12:00 abcd
2 another sample again
2 and again
2
2 Description
2 08.03.2017 12:00 abcd
2 another sample again
2 and again times two
Basically, there is an id and the rows contain the information in a very unstructured format. I want to extract the description which is after the last "Description" row and store that in 1 row. The resulting dataframe would look something like this:
id desc
1 this is another sample description which might be continued here or here
2 another sample again and again times two
From what I am able to think, I might have to use groupby but I don't know what to do after that.

Extract the positions of last Description and join the rows using str.cat
In [2840]: def lastjoin(x):
...: pos = x.desc.eq('Description').cumsum().idxmax()
...: return x.desc.loc[pos+2:].str.cat(sep=' ')
...:
In [2841]: df.groupby('id').apply(lastjoin)
Out[2841]:
id
1 this is another sample description which might...
2 another sample again and again times two
dtype: object
To have columns use reset_index
In [3216]: df.groupby('id').apply(lastjoin).reset_index(name='desc')
Out[3216]:
id desc
0 1 this is another sample description which might...
1 2 another sample again and again times two

Related

Python pandas: Get first values of group

I have a list of recorded diagnoses like this:
df = pd.DataFrame({
"DiagnosisTime": ["2017-01-01 08:23:00", "2017-01-01 08:23:00", "2017-01-01 08:23:03", "2017-01-01 08:27:00", "2019-12-31 20:19:39", "2019-12-31 20:19:39"],
"ID": [1,1,1,1,2,2]
})
There are multiple subjects that can be identified by an ID. For each subject there may be one or more diagnosis. Each diagnosis may be comprised of multiple entries (as multiple things are recoreded (not in this example)).
The individual diagnoses (with multiple rows) can (to some extend) be identified by the DiagnosisTime. However, sometimes there is a little delay during the writing of data for one diagnosis so that I want to allow a small tolerance of a few seconds when grouping by DiagnosisTime.
In this example I want a result as follows:
There are two diagnoses for ID 1: rows 0, 1, 2 and row 3. Note the slightly different DiagnosisTime in row 2 compared to 0 and 1. ID 2 is comprised of 1 diagnosis comprised of rows 4 and 5.
For each ID I want to set the counter back to 1 (or 0 if thats easier).
This is how far I've come:
df["DiagnosisTime"] = pd.to_datetime(df["DiagnosisTime"])
df["diagnosis_number"] = df.groupby([pd.Grouper(freq='5S', key="DiagnosisTime"), 'ID']).ngroup()
I think I successfully identified diagnoses within one ID (not entirely sure about the Grouper), but I don't know how to reset the counter.
If that is not possible I would also be satisfied with a function which returns all records of one ID that have the lowest diagnosis_number within that group.
You can add lambda function with GroupBy.transform and factorize:
df["diagnosis_number"] = (df.groupby('ID')['diagnosis_number']
.transform(lambda x: pd.factorize(x)[0]) + 1)
print (df)
DiagnosisTime ID diagnosis_number
0 2017-01-01 08:23:00 1 1
1 2017-01-01 08:23:00 1 1
2 2017-01-01 08:23:03 1 1
3 2017-01-01 08:27:00 1 2
4 2019-12-31 20:19:39 2 1
5 2019-12-31 20:19:39 2 1

How to find missing date rows in a sequence using pandas?

I have a dataframe with more than 4 million rows and 30 columns. I am just providing a sample of my patient dataframe
df = pd.DataFrame({
'subject_ID':[1,1,1,1,1,2,2,2,2,2,3,3,3],
'date_visit':['1/1/2020 12:35:21','1/1/2020 14:35:32','1/1/2020 16:21:20','01/02/2020 15:12:37','01/03/2020 16:32:12',
'1/1/2020 12:35:21','1/3/2020 14:35:32','1/8/2020 16:21:20','01/09/2020 15:12:37','01/10/2020 16:32:12',
'11/01/2022 13:02:31','13/01/2023 17:12:31','16/01/2023 19:22:31'],
'item_name':['PEEP','Fio2','PEEP','Fio2','PEEP','PEEP','PEEP','PEEP','PEEP','PEEP','Fio2','Fio2','Fio2']})
I would like to do two things
1) Find the subjects and their records which are missing in the sequence
2) Get the count of item_name for each subjects
For q2, this is what I tried
df.groupby(['subject_ID','item_name']).count() # though this produces output, column name is not okay. I mean why do it show the count value on `date_visit` column?
For q1, this is what I am trying
df['day'].le(df['shift_date'].add(1))
I expect my output to be like as shown below
You can get the first part with:
In [14]: df.groupby("subject_ID")['item_name'].value_counts().unstack(fill_value=0)
Out[14]:
item_name Fio2 PEEP
subject_ID
1 2 3
2 0 5
3 3 0
EDIT:
I think you've still got your date formats a bit messed up in your sample output, and strongly recommend switching everything to the ISO 8601 standard since that prevents problems like that down the road. pandas won't correctly parse that 11/01/2022 entry on its own, so I've manually fixed it in the sample.
Using what I assume these dates are supposed to be, you can find the gaps by grouping and using .resample():
In [73]: df['dates'] = pd.to_datetime(df['date_visit'])
In [74]: df.loc[10, 'dates'] = pd.to_datetime("2022-01-11 13:02:31")
In [75]: dates = df.groupby("subject_ID").apply(lambda x: x.set_index('dates').resample('D').first())
In [76]: dates.index[dates.isnull().any(axis=1)].to_frame().reset_index(drop=True)
Out[76]:
subject_ID dates
0 2 2020-01-02
1 2 2020-01-04
2 2 2020-01-05
3 2 2020-01-06
4 2 2020-01-07
5 3 2022-01-12
6 3 2022-01-14
7 3 2022-01-15
You can then add seq status to that first frame by checking whether the ID shows up in this new frame.

counting all string values in given column of a table and grouping it based on third column

I have three columns. the table looks like this:
ID. names tag
1. john. 1
2. sam 0
3. sam,robin. 1
4. robin. 1
Id: type integer
Names: type string
Tag: type integer (just 0,1)
What I want is to find how many times each name is repeated grouped by 0 and 1. this is to be done in python.
Answer must look like
0 1
John 23 12
Robin 32 10
sam 9 30
Using extractall and crosstab:
s = df.names.str.extractall(r'(\w+)').reset_index(1, drop=True).join(df.tag)
pd.crosstab(s[0], s['tag'])
tag 0 1
0
john 0 1
robin 0 2
sam 1 1
Because of the nature of your names column, there is some re-processing that needs to be done before you can get value counts. In the case of your example dataframe, this could look something like:
my_counts = (df.set_index(['ID.', 'tag'])
# Get rid of periods and split on commas
.names.str.strip('.').str.split(',')
.apply(pd.Series)
.stack()
.reset_index([0, 1])
# rename column 0 for consistency, easier reading
.rename(columns={0: 'names'})
# Get value counts of names per tag:
.groupby('tag')['names']
.value_counts()
.unstack('tag', fill_value=0))
>>> my_counts
tag 0 1
names
john 0 1
robin 0 2
sam 1 1

Efficient intersection of grouped pandas column

I have a tall pandas dataframe called use with columns ID, Date, .... Each row is unique, but each ID has many rows, with one row ID per date.
ID Date Other_data
1 1-1-01 10
2 1-1-01 23
3 1-1-01 0
1 1-2-01 11
3 1-2-01 1
1 1-3-01 9
2 1-3-01 20
3 1-3-01 2
I also have a list of unique ids, ids=use['ID'].drop_duplicates
I want to find the intersection of all of the dates, that is, only the dates for which each ID has data. The end result in this toy problem should be [1-1-01, 1-3-01]
Currently, I loop through, subsetting by ID and taking the intersection. Roughly speaking, it looks like this:
dates = use['Date'].drop_duplicates()
for i in ids:
id_dates = use[(use['ID'] == i)]['Date'].values
dates = set(dates).intersection(id_dates)
This strikes me as horrifically inefficient. What is a more efficient way to identify dates where each ID has data?
Thanks very much!
Using crosstab, when the value is 0 should be the target row . using df.eq(0).any(1). to find it
df=pd.crosstab(use.ID,use.Date)
df
Out[856]:
Date 1-1-01 1-2-01 1-3-01
ID
1 1 1 1
2 1 0 1
3 1 1 1
Find the unique IDs per date, then check if that's all of them.
gp = df.groupby('Date').ID.nunique()
gp[gp == df.ID.nunique()].index.tolist()
#['1-1-01', '1-3-01']

Pandas substring search for filter

I have a use case where I need to validate each row in the df and mark if it is correct or not. Validation rules are in another df.
Main
col1 col2
0 1 take me home
1 2 country roads
2 2 country roads take
3 4 me home
Rules
col3 col4
0 1 take
1 2 home
2 3 country
3 4 om
4 2 take
A row in main is marked as pass if the following condition matches for any row in rules
The condition for passing is:
col1==col3 and col4 is substring of col2
Main
col1 col2 result
0 1 take me home Pass
1 2 country roads Fail
2 2 country roads take Pass
3 4 me home Pass
My initial approach was to parse Rules df and create a function out of it dynamically and then run
def action_function(row) -> object:
if self.combined_filter()(row): #combined_filter() is the lambda equivalent of Rules df
return success_action(row) #mark as pass
return fail_action(row) #mark as fail
Main["result"] = self.df.apply(action_function, axis=1)
This turned out to be very slow as apply is not vectorized. The main df is about 3 million and Rules df is around 500 entries. Time taken is around 3 hour.
I am trying to use pandas merge for this. But substring match is not supported by the merge operation. I cannot split words by space or anything.
This will be used as part of a system. So I cannot hardcode anything. I need to read the df from excel every time system starts.
Can you please suggest an approach for this?
Merge and then apply the condtion using np.where i.e
temp = main.merge(rules,left_on='col1',right_on='col3')
temp['results'] = temp.apply(lambda x : np.where(x['col4'] in x['col2'],'Pass','Fail'),1)
no_dupe_df = temp.drop_duplicates('col2',keep='last').drop(['col3','col4'],1)
col1 col2 results
0 1 take me home Pass
2 2 country roads Fail
4 2 country roads take Pass
5 4 me home Pass

Categories

Resources