Calculate difference between values with same key index - python

Hello: I have this DataFrame (sample)
User
Timestamp
Production
A
2020-01-01
5
A
2020-06-01
7
A
2020-12-01
15
B
2020-01-01
2
B
2020-06-01
7
B
2020-12-01
9
So, I need to calculate the difference between the values and append to the DataFrame. The resulting column i'm trying to obtain should be as follows: the production column for all the periods per user.
The resulting table would be as follows (production column omitted due to problem in table editor):
User
Timestamp
Difference
A
2020-01-01
5
A
2020-06-01
2
A
2020-12-01
8
B
2020-01-01
2
B
2020-06-01
5
B
2020-12-01
2
So i tried with the .diff() function but obviously it doesn't recognize when the index is different. I then tried with a groupby() applied on the user column to then compute the diff function, but I get the same problem:
df['Difference'] = df.groupby('User')['Production'].diff()
Can someone help me out?
Thanks!
EDIT:
Made a step ahead, but still trying to figure it out:
i wrote this:
grouped = df.groupby('User')
diff = lambda x: x['Production'].shift(+1) - x['Production']
daf['diff'] = grouped.apply(diff).reset_index(0, drop=True).fillna(df['Production'])
This does the difference the way I want It, but still messes up when the User identifier changes.

Related

Improving speed of iterrows() query that is utilizing a mask

I have a large dataset that looks similar to this in terms of content:
test = pd.DataFrame({'date':['2018-08-01','2018-08-01','2018-08-02','2018-08-03','2019-09-01','2019-09-02','2019-09-03','2020-01-02','2020-01-03','2020-01-04','2020-10-04','2020-10-05'],
'account':['a','a','a','a','b','b','b','c','c','c','d','e']})
For each account, I am attempting to create a column that specifies "Yes" to rows that have the earliest date (even if that earliest date repeats), and "No" otherwise. I am using the following code which works nicely on a smaller subset of this data, but not on my entire (larger) dataset.
first_date = test.groupby('account').agg({'date':np.min})
test['first_date'] = 'No'
for row in first_date.iterrows():
account = row[0]
date = row[1].date
mask = (test.account == account) & (test.date == date)
test.loc[mask, 'first_date'] = 'Yes'
Any ideas for improvement? I'm fairly new to python and already having runtime issues for larger datasets that use pandas DataFrame. Thanks in advance.
Generally when we use pandas or numpy we want to avoid iterating over our data and use the provided vectorized methods.
Use groupby.transform to get a min date on each row, then use np.where to create your conditional column:
m = test['date'] == test.groupby('account')['date'].transform('min')
test['first_date'] = np.where(m, 'Yes', 'No')
date account first_date
0 2018-08-01 a Yes
1 2018-08-01 a Yes
2 2018-08-02 a No
3 2018-08-03 a No
4 2019-09-01 b Yes
5 2019-09-02 b No
6 2019-09-03 b No
7 2020-01-02 c Yes
8 2020-01-03 c No
9 2020-01-04 c No
10 2020-10-04 d Yes
11 2020-10-05 e Yes

Python pandas: Get first values of group

I have a list of recorded diagnoses like this:
df = pd.DataFrame({
"DiagnosisTime": ["2017-01-01 08:23:00", "2017-01-01 08:23:00", "2017-01-01 08:23:03", "2017-01-01 08:27:00", "2019-12-31 20:19:39", "2019-12-31 20:19:39"],
"ID": [1,1,1,1,2,2]
})
There are multiple subjects that can be identified by an ID. For each subject there may be one or more diagnosis. Each diagnosis may be comprised of multiple entries (as multiple things are recoreded (not in this example)).
The individual diagnoses (with multiple rows) can (to some extend) be identified by the DiagnosisTime. However, sometimes there is a little delay during the writing of data for one diagnosis so that I want to allow a small tolerance of a few seconds when grouping by DiagnosisTime.
In this example I want a result as follows:
There are two diagnoses for ID 1: rows 0, 1, 2 and row 3. Note the slightly different DiagnosisTime in row 2 compared to 0 and 1. ID 2 is comprised of 1 diagnosis comprised of rows 4 and 5.
For each ID I want to set the counter back to 1 (or 0 if thats easier).
This is how far I've come:
df["DiagnosisTime"] = pd.to_datetime(df["DiagnosisTime"])
df["diagnosis_number"] = df.groupby([pd.Grouper(freq='5S', key="DiagnosisTime"), 'ID']).ngroup()
I think I successfully identified diagnoses within one ID (not entirely sure about the Grouper), but I don't know how to reset the counter.
If that is not possible I would also be satisfied with a function which returns all records of one ID that have the lowest diagnosis_number within that group.
You can add lambda function with GroupBy.transform and factorize:
df["diagnosis_number"] = (df.groupby('ID')['diagnosis_number']
.transform(lambda x: pd.factorize(x)[0]) + 1)
print (df)
DiagnosisTime ID diagnosis_number
0 2017-01-01 08:23:00 1 1
1 2017-01-01 08:23:00 1 1
2 2017-01-01 08:23:03 1 1
3 2017-01-01 08:27:00 1 2
4 2019-12-31 20:19:39 2 1
5 2019-12-31 20:19:39 2 1

Pandas: using groupby and nunique taking time into account

I have a dataframe in this form:
A B time
1 2 2019-01-03
1 3 2018-04-05
1 4 2020-01-01
1 4 2020-02-02
where A and B contain some integer identifiers.
I want to measure the number of different identifiers each A has interacted with. To do this I usually simply do
df.groupby('A')['B'].nunique()
I now have to do a slightly different thing: each identifier has a date assigned (different for each identifier), that splits its interactions in 2 parts: the ones happening before that date, and the ones happening after that date. The same operation previously done (counting number of unique B interacted with ) needs to be done for both parts separately.
For example, if the date for A=1 was 2018-07-01, the output would be
A before after
1 1 2
In the real data, A contains millions of different identifiers, each with its unique date assigned.
EDITED
To be more clear I added a line to df. I want to count the number of different values of B each A interacts with before and after the date
I would convert A into dates, compare those with df['time'] and then groupby().value_counts():
(df['A'].map(date_dict)
.gt(df['time'])
.groupby(df['A'])
.value_counts()
.unstack()
.rename({False:'after',True:'before'}, axis=1)
)
Output:
after before
A
1 2 1

How to find missing date rows in a sequence using pandas?

I have a dataframe with more than 4 million rows and 30 columns. I am just providing a sample of my patient dataframe
df = pd.DataFrame({
'subject_ID':[1,1,1,1,1,2,2,2,2,2,3,3,3],
'date_visit':['1/1/2020 12:35:21','1/1/2020 14:35:32','1/1/2020 16:21:20','01/02/2020 15:12:37','01/03/2020 16:32:12',
'1/1/2020 12:35:21','1/3/2020 14:35:32','1/8/2020 16:21:20','01/09/2020 15:12:37','01/10/2020 16:32:12',
'11/01/2022 13:02:31','13/01/2023 17:12:31','16/01/2023 19:22:31'],
'item_name':['PEEP','Fio2','PEEP','Fio2','PEEP','PEEP','PEEP','PEEP','PEEP','PEEP','Fio2','Fio2','Fio2']})
I would like to do two things
1) Find the subjects and their records which are missing in the sequence
2) Get the count of item_name for each subjects
For q2, this is what I tried
df.groupby(['subject_ID','item_name']).count() # though this produces output, column name is not okay. I mean why do it show the count value on `date_visit` column?
For q1, this is what I am trying
df['day'].le(df['shift_date'].add(1))
I expect my output to be like as shown below
You can get the first part with:
In [14]: df.groupby("subject_ID")['item_name'].value_counts().unstack(fill_value=0)
Out[14]:
item_name Fio2 PEEP
subject_ID
1 2 3
2 0 5
3 3 0
EDIT:
I think you've still got your date formats a bit messed up in your sample output, and strongly recommend switching everything to the ISO 8601 standard since that prevents problems like that down the road. pandas won't correctly parse that 11/01/2022 entry on its own, so I've manually fixed it in the sample.
Using what I assume these dates are supposed to be, you can find the gaps by grouping and using .resample():
In [73]: df['dates'] = pd.to_datetime(df['date_visit'])
In [74]: df.loc[10, 'dates'] = pd.to_datetime("2022-01-11 13:02:31")
In [75]: dates = df.groupby("subject_ID").apply(lambda x: x.set_index('dates').resample('D').first())
In [76]: dates.index[dates.isnull().any(axis=1)].to_frame().reset_index(drop=True)
Out[76]:
subject_ID dates
0 2 2020-01-02
1 2 2020-01-04
2 2 2020-01-05
3 2 2020-01-06
4 2 2020-01-07
5 3 2022-01-12
6 3 2022-01-14
7 3 2022-01-15
You can then add seq status to that first frame by checking whether the ID shows up in this new frame.

Python Pandas sum a constant value in Columns If date between 2 dates

Let's assume a dataframe using datetimes as index, where we have a column named 'Score', initialy set to 10:
score
2016-01-01 10
2016-01-02 10
2016-01-03 10
2016-01-04 10
2016-01-05 10
2016-01-06 10
2016-01-07 10
2016-01-08 10
I want to substract a fixed value (let's say 1) from the score, but only when the index is between certain dates (for example between the 3rd and the 6th):
score
2016-01-01 10
2016-01-02 10
2016-01-03 9
2016-01-04 9
2016-01-05 9
2016-01-06 9
2016-01-07 10
2016-01-08 10
Since my real dataframe is big, and I will be doing this for different dateranges and different fixed values N for each one of them, I'd like to achieve this without requiring to create a new column set to -N for each case.
Something like numpy's where function, but for a certain range, and allowing me to sum/substract to current value if the condition is met, and do nothing otherwise. Is there something like that?
Use index slicing:
df.loc['2016-01-03':'2016-01-06', 'score'] -= 1
Assuming dates are datetime dtype:
#if date is own column:
df.loc[df['date'].dt.day.between(3,6), 'score'] = df['score'] - 1
#if date is index:
df.loc[df.index.day.isin(range(3,7)), 'score'] = df['score'] - 1
I would do something like that using query :
import pandas as pd
df = pd.DataFrame({"score":pd.np.random.randint(1,10,100)},
index=pd.date_range(start="2018-01-01", periods=100))
start = "2018-01-05"
stop = "2018-04-08"
df.query('#start <= index <= #stop ') - 1
Edit : note that something using eval which goes to boolean, can be used but in a different manner because pandas where acts on the False values.
df.where(~df.eval('#start <= index <= #stop '),
df['score'] - 1, axis=0, inplace=True)
See how I inverted the comparison operators (with ~), in order to get what I wanted. It's efficient but not really clear. Of course, you can also use pd.np.where and all is good in the world.

Categories

Resources