how to create a row_number based on some conditions in pandas - python

I have a data frame like this:
Clinic Number date
0 1 2015-05-05
1 1 2015-05-05
2 1 2016-01-01
3 2 2015-05-05
4 2 2016-05-05
5 3 2017-05-05
6 3 2017-05-05
I want to create a new column and fill it out based on some conditions. so the new data frame should be like this:
Clinic Number date row_number
0 1 2015-05-05 1
1 1 2015-05-05 1
2 1 2016-01-01 2
3 2 2015-05-05 3
4 2 2016-05-05 4
5 3 2017-05-05 5
6 3 2017-05-05 5
what is the rule for putting entries inside new column:
where Clinic Number and date is the same they will get same numbers, if it changes it will increases.
For example here 1 2015-05-05 has two rows which have same Clinic Number and date so they all get 1. the next row have Clinic Number=1 but the date is not the same as previous rows so it will get 2.
where Clinic Number=2 there is no row with Clinic Number=2 and the same date so it got 3 and the next row is 4...
till now I have tried something like this:
def createnumber(x):
x['row_number'] = i
d['row_number']= pd1.groupby(['Clinic Number','date']).apply(createnumber)
but I do not know how to implement this function.
I appreciate if you can help me with this:)
Also I have looked at links like this but they are not dynamic (i mean here the row number should be increased based on some conditions)

Instead of a groupby, you could just do something like this, naming your conditions seperately. So if the date shifts OR the clinic number changes, you return True, and then get the cumsum of those True values:
df['row_number'] = (df.date.ne(df.date.shift()) | df['Clinic Number'].ne(df['Clinic Number'].shift())).cumsum()
>>> df
Clinic Number date row_number
0 1 2015-05-05 1
1 1 2015-05-05 1
2 1 2016-01-01 2
3 2 2015-05-05 3
4 2 2016-05-05 4
5 3 2017-05-05 5
You'll need to make sure your dataframe is sorted by Clinic Number and Date first (you could do df.sort_values(['Clinic Number', 'date'], inplace=True) if it's not sorted already)

Related

Counting previous occurences of an ID in a Dataframe within a certain date range

I have a pandas dataframe containing dates of when a customer enters a shop. I'm looking for a method that will allow me to count the number of times a customer has visited a shop in the past month from the current Date_Visited including the current visit.
So, for a minimal dataset below
Customer_ID Date_Visited (Year-Month-Day)
1 2020-07-10
2 2020-07-09
1 2020-01-01
2 2020-07-08
1 2020-07-08
3 2020-07-01
I'm looking for an output of
Customer_ID Date_Visited visit_times
1 2020-07-10 2
2 2020-07-09 2
1 2020-01-01 1
2 2020-07-08 1
1 2020-07-08 1
3 2020-07-01 1
I've been able to use a solution involving loops - but this would be inefficient for large dataframes.
I've thought about trying to merge two copies of the dataframe and use a similar approach to that in Counting Number of Occurrences Between Dates (Given an ID value) From Another Dataframe but I'm not sure if this method is the best way to approach this problem
You can group by Customer_ID and year/month (using pandas.Grouper on the sorted dataframe (pandas.DataFrame.sort_values using the date column as key) and apply a cumcount per group (you need to add 1 as the count starts from 0 in python):
df['visit_times'] = (df.sort_values(by='Date_Visited (Year-Month-Day)')
.groupby(['Customer_ID',
pd.Grouper(freq='M', key='Date_Visited (Year-Month-Day)')
])
.cumcount()+1
)
output:
Customer_ID Date_Visited (Year-Month-Day) visit_times
0 1 2020-07-10 2
1 2 2020-07-09 2
2 1 2020-01-01 1
3 2 2020-07-08 1
4 1 2020-07-08 1
5 3 2020-07-01 1

Count number of columns above a date

I have a pandas dataframe with several columns and I would like to know the number of columns above the date 2016-12-31 . Here is an example:
ID
Bill
Date 1
Date 2
Date 3
Date 4
Bill 2
4
6
2000-10-04
2000-11-05
1999-12-05
2001-05-04
8
6
8
2016-05-03
2017-08-09
2018-07-14
2015-09-12
17
12
14
2016-11-16
2017-05-04
2017-07-04
2018-07-04
35
And I would like to get this column
Count
0
2
3
Just create the mask and call sum on axis=1
date = pd.to_datetime('2016-12-31')
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1)
OUTPUT:
0 0
1 2
2 3
dtype: int64
If needed, call .to_frame('count') to create datarame with column as count
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1).to_frame('Count')
Count
0 0
1 2
2 3
Use df.filter to filter the Date* columns + .sum(axis=1)
(df.filter(like='Date') > '2016-12-31').sum(axis=1).to_frame(name='Count')
Result:
Count
0 0
1 2
2 3
You can do:
df['Count'] = (df.loc[:, [x for x in df.columns if 'Date' in x]] > '2016-12-31').sum(axis=1)
Output:
ID Bill Date 1 Date 2 Date 3 Date 4 Bill 2 Count
0 4 6 2000-10-04 2000-11-05 1999-12-05 2001-05-04 8 0
1 6 8 2016-05-03 2017-08-09 2018-07-14 2015-09-12 17 2
2 12 14 2016-11-16 2017-05-04 2017-07-04 2018-07-04 35 3
We select columns with 'Date' in the name. It's better when we have lots of columns like these and don't want to put them one by one. Then we compare it with lookup date and sum 'True' values.

python and dataframe: group by week and calculate the sum and difference

I have a dataframe with the following columns:
DATE ALFA BETA
2016-04-26 1 3
2016-04-27 3 0
2016-04-28 0 8
2016-04-29 4 2
2016-04-30 3 1
2016-05-01 -2 -5
2016-05-02 3 0
2016-05-03 3 3
2016-05-08 1 7
2016-05-11 3 1
2016-05-12 10 1
2016-05-13 4 2
I would like to group the data in a weekly range but treat the alpha and beta columns differently. I would like to calculate the sum of the numbers in the ALFA column for each week while for the BETA column I would like to calculate the difference between the beginning and the end of the week. I show you an example of the expected result.
DATE sum_ALFA diff_BETA
2016-04-26 12 3
2016-05-03 4 4
2016-05-11 17 1
I have tried this code but it calculates the sum for each column
df = df.resample('W', on='DATE').sum().reset_index().sort_values(by='DATE')
this is my dataset https://drive.google.com/uc?export=download&id=1fEqjINx9R5io7t_YxA9qShvNDxWRCUke
I'd guess I'm having a different locale here (hence my week is different), you can do:
df.resample("W", on="DATE",closed="left", label="left"
).agg({"ALFA":"sum", "BETA": lambda g: g.iloc[0] - g.iloc[-1]})
ALFA BETA
DATE
2016-04-24 11 2
2016-05-01 4 -8
2016-05-08 18 5
I think there is a solution for your data with my approach. Define
def get_series_first_minus_last(s):
try:
return s.iloc[0] - s.iloc[-1]
except IndexError:
return 0
and replace the lambda call just by the function call, i.e.
df.resample("W", on="DATE",closed="left", label="left"
).agg({"ALFA":"sum", "BETA": get_series_first_minus_last})
Note that in the newly defined function, you could also return nan if you'd prefer that.

Calculating readmission rate

I am fairly new to Python and I am trying to calculate if a patient was readmitted to the hospital within 30 days or not.
The data is in the form of Pandas dataframe with columns for Patient Id, Arrival Date, Departure Date and Status (Discharged, Admitted, Did Not Wait). The question is similar to this past question with same requirements but I need the code in Python.
Calculate readmission rate
I only need one column of readmission (30 day readmission status). Any help in the code's translation is appreciated. Thanks in advance.
# anky_91 Please do correct me if I am wrong in my understanding. Some random examples my dataex1 ex2 ex3
You can use the below:
df.groupby('Patient').apply(lambda x : (x['Admission Date'].\
shift(-1)-x['Discharge date']).dt.days.le(30).astype(int)).reset_index(drop=True)
Full code:
Considering the df looks like:
Visit Patient Admission Date Discharge date
0 1 1 2015-01-01 2015-01-02
1 2 2 2015-01-01 2015-01-01
2 3 3 2015-01-01 2015-01-02
3 4 1 2015-01-09 2015-01-09
4 5 2 2015-04-01 2015-04-05
5 6 1 2015-05-01 2015-05-01
df[['Admission Date','Discharge date']] = df[['Admission Date','Discharge date']].\
apply(lambda x: pd.to_datetime(x))
df = df.sort_values(['Patient','Admission Date']) #Thanks #Jondiedoop
df['Readmit30']=df.groupby('Patient').apply(lambda x : (x['Admission Date'].\
shift(-1)-x['Discharge date']).dt.days.le(30).astype(int)).reset_index(0).drop('Patient',1)
print(df)
Visit Patient Admission Date Discharge date Readmit30
0 1 1 2015-01-01 2015-01-02 1
3 4 1 2015-01-09 2015-01-09 0
5 6 1 2015-05-01 2015-05-01 0
1 2 2 2015-01-01 2015-01-01 0
4 5 2 2015-04-01 2015-04-05 0
2 3 3 2015-01-01 2015-01-02 0
You can try this one also ( Don't know why upper one was giving false readmission flags for me):
After sorting on visit_start_date
visits_pandas_df.groupby('PatientId').apply(lambda x: (((x['visit_start_date'].shift(-1)-x['visit_end_date']).dt.days.shift(1).le(30)) ).astype(int)).values
Visits having only difference of one day are not counted in readmissions. So you will also need to check in your logic.

Iterating and averaging pandas data frame

I have a database with a lot of rows such as:
timestamp name price profit
bob 5 4
jim 3 2
jim 2 6
bob 6 7
jim 4 1
jim 6 3
bob 3 1
The data base is sorted by a timestamp. I would like to be able to add a new column where it would take the last 2 values in the price column before the current value and average them into a new column. So that the first three rows would look something like this with a new column:
timestamp name price profit new column
bob 5 4 4.5
jim 3 2 3
jim 2 6 5
(6+3)/2 = 4.5
(2+4)/2 = 3
(4+6)/2 = 5
This isn't for a school project or anything this is just something I'm working on on my own. I've tried asking a similar question to this but I don't think I was very clear. Thanks in advance!
def shift_n_roll(df):
return df.shift(-1).rolling(2).mean().shift(-1)
df['new column'] = df.groupby('name').price.apply(shift_n_roll)
df
By looking at the result you want, I'm guess you want average of the two prices following the current one instead of "2 values in the price column before the current value".
I made up timestamp values that you omitted to be clear.
print df
timestamp name price profit
0 2016-01-01 bob 5 4
1 2016-01-02 jim 3 2
2 2016-01-03 jim 2 6
3 2016-01-04 bob 6 7
4 2016-01-05 jim 4 1
5 2016-01-06 jim 6 3
6 2016-01-07 bob 3 1
#No need to sort if you already did.
#df.sort_values(['name','timestamp'], inplace=True)
df['new column'] = (df.groupby('name')['price'].shift(-1) + df.groupby('name')['price'].shift(-2)) / 2
print df.dropna()
timestamp name price profit new column
0 2016-01-01 bob 5 4 4.5
1 2016-01-02 jim 3 2 3.0
2 2016-01-03 jim 2 6 5.0

Categories

Resources