i have a dataframe date column with below values
2015-01-01
2015-02-01
2015-03-01
2015-07-01
2015-08-01
2015-10-01
2015-11-01
2016-02-01
i want to find the difference of these values in months, as below
date_dt diff_mnts
2015-01-01 0
2015-02-01 1
2015-03-01 1
2015-07-01 4
2015-08-01 1
2015-10-01 2
2015-11-01 1
2016-02-01 3
i tried to use the diff() method to calculate the days and then convert to astype('timedelta64(M)'). but in those cases, when days are less than 30 - its showing month difference values as 0. please let me know, if there is any easy built in function, which i can try in this case.
Option 1
Change the period and call diff.
df
Date
0 2015-01-01
1 2015-02-01
2 2015-03-01
3 2015-07-01
4 2015-08-01
5 2015-10-01
6 2015-11-01
7 2016-02-01
df.Date.dtype
dtype('<M8[ns]')
df.Date.dt.to_period('M').diff().fillna(0)
0 0
1 1
2 1
3 4
4 1
5 2
6 1
7 3
Name: Date, dtype: int64
Option 2
Alternatively, call diff on dt.month, but you'll need to account for gaps over a year (solution improved thanks to #galaxyan!) -
i = df.Date.dt.year.diff() * 12
j = df.Date.dt.month.diff()
(i + j).fillna(0).astype(int)
0 0
1 1
2 1
3 4
4 1
5 2
6 1
7 3
Name: Date, dtype: int64
Caveat (thanks to for spotting it) is that this wouldn't work for gaps over a year.
Try the following steps
Cast the column into datetime format.
Use the .month method to get the month number
Use the shift() method in pandas to calculate difference
example code will look something like this
df['diff_mnts'] = date_dt.month - date_dt.shift().month
Related
I am fairly new to Python and I am trying to calculate if a patient was readmitted to the hospital within 30 days or not.
The data is in the form of Pandas dataframe with columns for Patient Id, Arrival Date, Departure Date and Status (Discharged, Admitted, Did Not Wait). The question is similar to this past question with same requirements but I need the code in Python.
Calculate readmission rate
I only need one column of readmission (30 day readmission status). Any help in the code's translation is appreciated. Thanks in advance.
# anky_91 Please do correct me if I am wrong in my understanding. Some random examples my dataex1 ex2 ex3
You can use the below:
df.groupby('Patient').apply(lambda x : (x['Admission Date'].\
shift(-1)-x['Discharge date']).dt.days.le(30).astype(int)).reset_index(drop=True)
Full code:
Considering the df looks like:
Visit Patient Admission Date Discharge date
0 1 1 2015-01-01 2015-01-02
1 2 2 2015-01-01 2015-01-01
2 3 3 2015-01-01 2015-01-02
3 4 1 2015-01-09 2015-01-09
4 5 2 2015-04-01 2015-04-05
5 6 1 2015-05-01 2015-05-01
df[['Admission Date','Discharge date']] = df[['Admission Date','Discharge date']].\
apply(lambda x: pd.to_datetime(x))
df = df.sort_values(['Patient','Admission Date']) #Thanks #Jondiedoop
df['Readmit30']=df.groupby('Patient').apply(lambda x : (x['Admission Date'].\
shift(-1)-x['Discharge date']).dt.days.le(30).astype(int)).reset_index(0).drop('Patient',1)
print(df)
Visit Patient Admission Date Discharge date Readmit30
0 1 1 2015-01-01 2015-01-02 1
3 4 1 2015-01-09 2015-01-09 0
5 6 1 2015-05-01 2015-05-01 0
1 2 2 2015-01-01 2015-01-01 0
4 5 2 2015-04-01 2015-04-05 0
2 3 3 2015-01-01 2015-01-02 0
You can try this one also ( Don't know why upper one was giving false readmission flags for me):
After sorting on visit_start_date
visits_pandas_df.groupby('PatientId').apply(lambda x: (((x['visit_start_date'].shift(-1)-x['visit_end_date']).dt.days.shift(1).le(30)) ).astype(int)).values
Visits having only difference of one day are not counted in readmissions. So you will also need to check in your logic.
I want to apply an operation to the following data frame:
index date username count
0 2015-11-01 1 16
1 2015-11-01 2 1
2 2015-11-01 3 1
3 2015-10-01 1 2
4 2015-10-01 4 29
5 2015-10-01 5 1
6 2014-09-01 1 3
7 2014-09-01 3 1
8 2014-09-01 4 1
And apply an operation that will get it to this:
index date mean
0 2015-11-01 6
1 2015-10-01 10.7
2 2014-09-01 1.3
The calculation takes the sum of all counts in a given date (e.g. for 2015-11-01 is it is 16+1+1=18) then divides by the unique number of usernames for a given date (e.g. for 2015-10-01 there are 3). A new column, mean is created to record the calculation, in this case we have called it mean.
I have been trying to use the 'apply' method from DataFrame but without success yet. Help would be very much appreciated. Thanks
You can use GroupBy + sum divided by GroupBy + nunique:
g = df.groupby('date')
res = g['count'].sum().div(g['username'].nunique())\
.rename('mean').reset_index()
print(res)
date mean
0 2014-09-01 1.666667
1 2015-10-01 10.666667
2 2015-11-01 6.000000
I am looking to group by two columns: user_id and date; however, if the dates are close enough, I want to be able to consider the two entries part of the same group and group accordingly. Date is m-d-y
user_id date val
1 1-1-17 1
2 1-1-17 1
3 1-1-17 1
1 1-1-17 1
1 1-2-17 1
2 1-2-17 1
2 1-10-17 1
3 2-1-17 1
The grouping would group by user_id and dates +/- 3 days from each other. so the group by summing val would look like:
user_id date sum(val)
1 1-2-17 3
2 1-2-17 2
2 1-10-17 1
3 1-1-17 1
3 2-1-17 1
Any way someone could think of that this could be done (somewhat) easily? I know there are some problematic aspects of this. for example, what to do if the dates string together endlessly with three days apart. but the exact data im using only has 2 values per person..
Thanks!
I'd convert this to a datetime column and then use pd.TimeGrouper:
dates = pd.to_datetime(df.date, format='%m-%d-%y')
print(dates)
0 2017-01-01
1 2017-01-01
2 2017-01-01
3 2017-01-01
4 2017-01-02
5 2017-01-02
6 2017-01-10
7 2017-02-01
Name: date, dtype: datetime64[ns]
df = (df.assign(date=dates).set_index('date')
.groupby(['user_id', pd.TimeGrouper('3D')])
.sum()
.reset_index())
print(df)
user_id date val
0 1 2017-01-01 3
1 2 2017-01-01 2
2 2 2017-01-10 1
3 3 2017-01-01 1
4 3 2017-01-31 1
Similar solution using pd.Grouper:
df = (df.assign(date=dates)
.groupby(['user_id', pd.Grouper(key='date', freq='3D')])
.sum()
.reset_index())
print(df)
user_id date val
0 1 2017-01-01 3
1 2 2017-01-01 2
2 2 2017-01-10 1
3 3 2017-01-01 1
4 3 2017-01-31 1
Update: TimeGrouper will be deprecated in future versions of pandas, so Grouper would be preferred in this scenario (thanks for the heads up, Vaishali!).
I come with a very ugly solution but still work...
df=df.sort_values(['user_id','date'])
df['Key']=df.sort_values(['user_id','date']).groupby('user_id')['date'].diff().dt.days.lt(3).ne(True).cumsum()
df.groupby(['user_id','Key'],as_index=False).agg({'val':'sum','date':'first'})
Out[586]:
user_id Key val date
0 1 1 3 2017-01-01
1 2 2 2 2017-01-01
2 2 3 1 2017-01-10
3 3 4 1 2017-01-01
4 3 5 1 2017-02-01
I have the foll. list in pandas:
str = jan_1 jan_15 feb_1 feb_15 mar_1 mar_15 apr_1 apr_15 may_1 may_15 jun_1 jun_15 jul_1 jul_15 aug_1 aug_15 sep_1 sep_15 oct_1 oct_15 nov_1 nov_15 dec_1 dec_15
Is there a way to convert it into datetime?
I tried:
pd.to_datetime(pd.Series(str))
You have to specify the format argument while calling pd.to_datetime. Try
pd.to_datetime(pd.Series(s), format='%b_%d')
this gives
0 1900-01-01
1 1900-01-15
2 1900-02-01
3 1900-02-15
4 1900-03-01
5 1900-03-15
6 1900-04-01
7 1900-04-15
8 1900-05-01
9 1900-05-15
For setting the current year, a hack may be required, like
pd.to_datetime(pd.Series(s) + '_2015', format='%b_%d_%Y')
to get
0 2015-01-01
1 2015-01-15
2 2015-02-01
3 2015-02-15
4 2015-03-01
5 2015-03-15
6 2015-04-01
7 2015-04-15
8 2015-05-01
9 2015-05-15
I have a simple dataframe that looks like this:
I would like to use groupby to group by id, then find some way to difference the dates, and then column bind them back to the dataframe, so I end up with this:
The groupby is straightforward,
grouped = DF.groupby('id')
and finding the earliest date is straightforward,
maxdates = grouped['date'].min()
But I'm not sure how to proceed. How do I apply the date subtraction operation, then combine?
There is a similar question here.
Thanks for reading this far.
My dataframe is:
dates=pd.to_datetime(['2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01', '2015-05-01', '2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04', '2015-01-05'])
DF = DataFrame({'id':[1,1,1,1,1,2,2,2,2,2], 'date':dates})
cols = ['id', 'date']
DF=DF[cols]
EDIT:
Both answers below are awesome. I wish I could accept them both.
You can use apply like this:
earliest_by_id = DF.groupby('id')['date'].min()
def since_earliest(row):
return row.date - earliest_by_id[row.id]
DF['days_since_earliest'] = DF.apply(since_earliest, axis=1)
print(DF)
id date days_since_earliest
0 1 2015-01-01 0 days
1 1 2015-02-01 31 days
2 1 2015-03-01 59 days
3 1 2015-04-01 90 days
4 1 2015-05-01 120 days
5 2 2015-01-01 0 days
6 2 2015-01-02 1 days
7 2 2015-01-03 2 days
8 2 2015-01-04 3 days
9 2 2015-01-05 4 days
edit:
DF['days_since_earliest'] = DF.apply(since_earliest, axis=1).astype('timedelta64[D]')
print(DF)
id date days_since_earliest
0 1 2015-01-01 0
1 1 2015-02-01 31
2 1 2015-03-01 59
3 1 2015-04-01 90
4 1 2015-05-01 120
5 2 2015-01-01 0
6 2 2015-01-02 1
7 2 2015-01-03 2
8 2 2015-01-04 3
9 2 2015-01-05 4
FWIW, using transform can often be simpler (and usually faster) than apply. transform takes the results of a groupby operation and broadcasts it up to the original index:
>>> df["dse"] = df["date"] - df.groupby("id")["date"].transform(min)
>>> df
id date dse
0 1 2015-01-01 0 days
1 1 2015-02-01 31 days
2 1 2015-03-01 59 days
3 1 2015-04-01 90 days
4 1 2015-05-01 120 days
5 2 2015-01-01 0 days
6 2 2015-01-02 1 days
7 2 2015-01-03 2 days
8 2 2015-01-04 3 days
9 2 2015-01-05 4 days
If you'd prefer integer days instead of timedelta objects, you can use the dt.days accessor:
>>> df["dse"] = df["dse"].dt.days
>>> df
id date dse
0 1 2015-01-01 0
1 1 2015-02-01 31
2 1 2015-03-01 59
3 1 2015-04-01 90
4 1 2015-05-01 120
5 2 2015-01-01 0
6 2 2015-01-02 1
7 2 2015-01-03 2
8 2 2015-01-04 3
9 2 2015-01-05 4