Very simple pandas column/row transform that I cannot figure out - python

I need to do a simple calculation on values in a dataframe, but I need some column transposed first. Once they are transposed I want to take the most recent amount / 2nd most recent amount and then the binary result if it less than or equal to .5
By most recent I mean most recent to the date in the Date 2 column
Have This
| Name | Amount | Date 1 | Date 2 |
| -----| ---- |------------------------|------------|
| Jim | 100 | 2021-06-10 | 2021-06-15 |
| Jim | 200 | 2021-05-11 | 2021-06-15 |
| Jim | 150 | 2021-03-5 | 2021-06-15 |
| Bob | 350 | 2022-06-10 | 2022-08-30 |
| Bob | 300 | 2022-08-12 | 2022-08-30 |
| Bob | 400 | 2021-07-6 | 2022-08-30 |
I Want this
| Name | Amount | Date 2| Most Recent Amount(MRA) | 2nd Most Recent Amount(2MRA) | MRA / 2MRA| Less than or equal to .5 |
| -----| -------|------------------------|----------------|--------------------|-------------|--------------------------|
| Jim | 100 | 2021-06-15 | 100 | 200 | .5 | 1 |
| Bob | 300 | 2022-08-30 | 300 | 350 | .85 | 0 |

This is the original dataframe.
df = pd.DataFrame({'Name':['Jim','Jim','Jim','Bob','Bob','Bob'],
'Amount':[100,200,150,350,300,400],
'Date 1':['2021-06-10','2021-05-11','2021-03-05','2022-06-10','2022-08-12','2021-07-06'],
'Date 2':['2021-06-15','2021-06-15','2021-06-15','2022-08-30','2022-08-30','2022-08-30']
})
And this is the results.
# here we take the gropby of the 'Name' column
g = df.sort_values('Date 1', ascending=False).groupby(['Name'])
# then we use the agg function to get the first of 'Date 2' and 'Amount' columns
# and then rename result of the 'Amount' column to 'MRA'
first = g.agg({'Date 2':'first','Amount':'first'}).rename(columns={'Amount':'MRA'}).reset_index()
# Similarly, we take the second values by applying a lambda function
second = g.agg({'Date 2':'first','Amount':lambda t: t.iloc[1]}).rename(columns={'Amount':'2MRA'}).reset_index()
df_T = pd.merge(first, second, on=['Name','Date 2'], how='left')
# then we use this function to add two desired columns
def operator(x):
return x['MRA']/x['2MRA'], [1 if x['MRA']/x['2MRA']<=.5 else 0][0]
# we apply the operator function to add 'MRA/2MRA' and 'Less than or equal to .5' columns
df_T['MRA/2MRA'], df_T['Less than or equal to .5'] = zip(*df_T.apply(operator, axis=1))
Hope this helps. :)

One way to do what you've asked is:
df = ( df[df['Date 1'] <= df['Date 2']]
.groupby('Name', sort=False)['Date 1'].nlargest(2)
.reset_index(level=0)
.assign(**{
'Amount': df.Amount,
'Date 2': df['Date 2'],
'recency': ['MRA','MRA2']*len(set(df.Name.tolist()))
})
.pivot(index=['Name','Date 2'], columns='recency', values='Amount')
.reset_index().rename_axis(columns=None) )
df = df.assign(**{'Amount':df.MRA, 'MRA / MRA2': df.MRA/df.MRA2})
df = df.assign(**{'Less than or equal to .5': (df['MRA / MRA2'] <= 0.5).astype(int)})
df = pd.concat([df[['Name', 'Amount']], df.drop(columns=['Name', 'Amount'])], axis=1)
Input:
Name Amount Date 1 Date 2
0 Jim 100 2021-06-10 2021-06-15
1 Jim 200 2021-05-11 2021-06-15
2 Jim 150 2021-03-05 2021-06-15
3 Bob 350 2022-06-10 2022-08-30
4 Bob 300 2022-08-12 2022-08-30
5 Bob 400 2021-07-06 2022-08-30
Output:
Name Amount Date 2 MRA MRA2 MRA / MRA2 Less than or equal to .5
0 Bob 300 2022-08-30 300 350 0.857143 0
1 Jim 100 2021-06-15 100 200 0.500000 1
Explanation:
Filter only for rows where Date 1 <= Date 2
Use groupby() and nlargest() to get the 2 most recent Date 1 values per Name
Use assign() to add back the Amount and Date 2 columns and create a recency column containing MRA and MRA2 for the pair of rows corresponding to each Name value
Use pivot() to turn the recency values MRA and MRA2 into column labels
Use reset_index() to restore Name and Date 2 to columns, and use rename_axis() to make the columns index anonymous
Use assign() once to restore Amount and add column MRA / MRA2, and again to add column named Less than or equal to .5
Use concat(), [] and drop() to rearrange the columns to match the output sequence shown in the question.

Here's the rough procedure you want:
sort_values by Name and Date 1 to get the data in order.
shift to get the previous date and 2nd most recent amount fields
Filter the dataframe for Date 1 <= Date 2.
group_by by Name and use head to get only the first row.
Now, your Amount column is your Most Recent Amount and your Shifted Amount column is the 2nd Most Recent amount. From there, you can do a simple division to get the ratio.

Related

Subsetting data with a column condition

I have a dataframe which contains Date, Visitor_ID and Pages columns. In the Page_visited column there are different row wise entries for each dates. Please refer the below table to understand the data.
[| Dates | Visitor_ID| Pages |
|:------ |:---------:| -----: |
| 10/1/2021 | 1 | xy |
| 10/1/2021 | 1 | step2 |
|10/1/2021 | 1 | xx |
|10/1/2021 | 1 | NetBanking|
| 10/1/2021 | 2 | step1 |
| 10/1/2021 | 2 | xy |
|10/1/2021 | 3 | step1 |
|10/1/2021 | 3 | NetBanking|
|11/1/2021 | 4 | step1 |
|12/1/2021 | 4 | NetBanking|][1]
Desired output:
Date Visitor_ID
|10/1/2021 | 1 |
|10/1/2021 | 3 |
the output should be a subset of actual data where the condition is that if for same Visitor_ID the page contains string "step" before string "Netbanking in same date then return the Visitor ID.
To initialise your dataframe you could do:
import pandas as pd
columns = ["Dates", "Visitor_ID", "Pages"]
records = [
["10/1/2021", 1, "xy"],
["10/1/2021", 1, "step2"],
["10/1/2021", 1, "NetBanking"],
["10/1/2021", 2, "step1"],
["10/1/2021", 2, "xy"],
["10/1/2021", 3, "step1"],
["10/1/2021", 3, "NetBanking"],
["11/1/2021", 4, "step1"],
["12/1/2021", 4, "NetBanking"]]
data = pd.DataFrame().from_records(records, columns=columns)
data["Dates"] = pd.DatetimeIndex(data["Dates"])
index_names = columns[:2]
data.set_index(index_names, drop=True, inplace=True)
Note that I have left out your third line in the records, otherwise I cannot reproduce your desired output. I have made this a multi-index data frame in order to easily loop over the groups 'date/visitor'. The structure of the dataframe looks like:
print(data)
Pages
Dates Visitor_ID
2021-10-01 1 xy
1 step2
1 NetBanking
2 step1
2 xy
3 step1
3 NetBanking
2021-11-01 4 step1
2021-12-01 4 NetBanking
Now to select the customers from the same date and from the same group, I am going to loop over these groups and use 2 masks to select the required records:
for date_time, data_per_date in data.groupby(level=0):
for visitor, data_per_visitor in data_per_date.groupby(level=0):
# select the column with the Pages
pages = data_per_visitor["Pages"].str
# make 2 boolean masks, for the records with step and netbanking
has_step = pages.contains("step")
has_netbanking = pages.contains("NetBanking")
# to get the records after each 'step' records, apply a diff on 'has_step'
# Convert to int first for the correct result
# each diff with outcome -1 fulfills this requirement. Make a
# mask based on this requirement
diff_step = has_step.astype(int).diff()
records_after_step = diff_step == -1
# combine the 2 mask to create your final mask to make a selection
mask = records_after_step & has_netbanking
# select the records and print to screen
selection = data_per_visitor[mask]
if not selection.empty:
print(selection.reset_index()[index_names])
This gives the following output:
Dates Visitor_ID
0 2021-10-01 1
1 2021-10-01 3
EDIT:
I was reading your question again. The solution above assumed that only records with 'NetBanking' directly following a record with 'step' is valid. That is why I thought your example input was not corresponding with your desired output. However, in case you are allowing rows in between an occurrence with 'step' and the first 'netbanking', the solution does not work. In that case, it is better to explicitly iterate of the rows of your dataframe per date and client id. An example then would be:
for date_time, data_per_date in data.groupby(level=0):
for visitor, data_per_visitor in data_per_date.groupby(level=0):
after_step = False
index_selection = list()
data_per_visitor.reset_index(inplace=True)
for index, records in data_per_visitor.iterrows():
page = records["Pages"]
if "step" in page and not after_step:
after_step = True
if "NetBanking" in page and after_step:
index_selection.append(index)
after_step = False
selection = data_per_visitor.reindex(index_selection)
if not selection.empty:
print(selection.reset_index()[index_names]
Normally I would not recommend to use 'iterrows' as it is really slow, but in this case I don't see an easy other solution. The output of the second algorithm is the same as the first for my data. In case you do include the third line from your example data, the second algorithm still gives the same output.

Re name Columns Pandas

Hi I have create a new data frame based on groupby, mean and count as per below:
suburb_price = HH.groupby(['Suburb']).agg({'Price':['mean'],
'Suburb':['count']})
|Suburb |Price |Suburb
|mean |count
|0 |Austins Ferry |585,000 | 1
|1 |Battery Point |700,000 | 1
|2 |Bellerive |498,571 | 7
|3 |Berriedale |465,800 | 5
|4 |Blackmans Bay |625,000 | 1
and I want to change the name of the columns by using
suburb_price.reset_index(level=0,inplace=True)
suburb_price.rename(index={0:'Suburb Name',1:'Average Price',2:'Number of Properties'})
but it does not seem to work, not sure why
Your solution should working if rename columns by range with length of columns and pass to rename parameter columns:
suburb_price.reset_index(inplace=True)
suburb_price.columns = range(len(suburb_price.columns))
suburb_price = suburb_price.rename(columns={0:'Suburb Name',1:'Average Price',2:'Number of Properties'})
Simplier is set columns names by list:
suburb_price.reset_index(inplace=True)
suburb_price.columns = ['Suburb Name','Average Price','Number of Properties']
Another idea is use GroupBy.agg with named aggregations and rename column Suburb:
test1 = [[0,7,50], [0,3,51], [0,3,45], [1,5,50],[1,0,50],[2,6,50]]
HH = pd.DataFrame({'Suburb':list('aaabbc'), 'Price':[5,10,20,2,45,3]})
print (HH)
Suburb Price
0 a 5
1 a 10
2 a 20
3 b 2
4 b 45
5 c 3
suburb_price = (HH.groupby(['Suburb'])
.agg(**{'Average Price': ('Price', 'mean'),
'Number of Properties': ('Suburb', 'count')})
.reset_index()
.rename(columns={'Suburb':'Suburb Name'}))
print (suburb_price)
Suburb Name Average Price Number of Properties
0 a 11.666667 3
1 b 23.500000 2
2 c 3.000000 1

stack data based on column

i am in python i have a data frame like this contain sub_id refer to patient_id, hour_measure from 1 to 22 and other patient's measurement
subject_id | hour_measure heart rate | urinecolor | blood pressure
--------------------------------------------------------
3 | 1 40 | red | high
3 | 2 60 | red | high
3 | .. .. | .. | ..
3 | 22 90 | red | high
4 | 3 60 | yellow | low
4 | 3 60 | yellow | low
4 | 22 90 | red | high
i want to group sub_id measurement by max min skew,etc for numeric features and first and last value for categorical
i write the follwing code
df= pd.read_csv(path)
df1 = (df.groupby(['subject_id','hour_measure'])
.agg([ 'sum','min','max', 'median','var','skew']))
f = lambda x: next(iter(x.mode()), None)
cols = df.select_dtypes(object).columns
df2 = df.groupby(['subject_id','hour_measure'])[cols].agg(f)
df2.columns = pd.MultiIndex.from_product([df2.columns, ['mode']])
print (df2)
df3 = pd.concat([df1, df2], axis=1).unstack().reorder_levels([0,2,1],axis= 1)
print (df3)
df3.to_csv("newfile.csv")
it give me the grouping for every hour
i try to make it group only with subject id only
df1 = (df.groupby(['subject_id'])
.agg([ 'sum','min','max', 'median','var','skew']))
it also give me the same output , and calculate the statistics for every hour as follows
subject_id | heart rate_1 | heartrate_2 ....
--------------------------------------------------------
| min max mean | min max mean ....
3
4
i want the out put to be as the following
subject_id | heart rate | repiratotry rate |urine color
--------------------------------------------------------
| min | max | mean | min | max | mean ..|. first | last
3 50 60 55 40 65 20 | yellow | red
any one can tell how can i edit the code to give the wanted output
any help will appreciated
let me know if this gets you close to what you're looking for. I did not run into your issue with grouping by every hour so I'm not sure if I understood your question completely.
# sample dataframe
df = pd.DataFrame(
{
"subject_id": [1, 1, 1, 2, 2, 2, 3, 3, 3],
"hour_measure": [1, 22, 12, 5, 18, 21, 8, 18, 4],
"blood_pressure": [
"high",
"high",
"high",
"high",
"low",
"low",
"low",
"low",
"high",
],
}
)
# sort out numeric columns before aggregating them
numeric_result = (
df.select_dtypes(include="number")
.groupby(["subject_id"])
.agg(["min", "max", "mean"])
)
# sort out categorical columns before aggregating them
categorical_result = (
df.set_index(["subject_id"])
.select_dtypes(include="object")
.groupby(["subject_id"])
.agg(["first", "last"])
)
# combine numeric and categorical results
result = numeric_result.join(categorical_result)
hour_measure blood_pressure
min max mean first last
subject_id
1 1 22 11.666667 high high
2 5 21 14.666667 high low
3 4 18 10.000000 low high

Pandas - Sum Previous Rows if Value In Column Meets Condition

I have a dataframe that is of the following type. I have all the columns except the final column, "Total Previous Points P1", which I am hoping to create:
The data is sorted by the "Date" column.
Date | Points_P1 | P1_id | P2_id | Total_Previous_Points_P1
-------------+---------------+----------+-----------------------------------
10/08/15 | 5 | 100 | 90 | 500
-------------+---------------+----------+-----------------------------------
11/09/16 | 5 | 100 | 90 | 500
-------------+---------------+----------+-----------------------------------
20/09/19 | 10 | 10000 | 360 | 4,200
-------------+---------------+----------+-----------------------------------
... | | ... | ... | ...
-------------+---------------+----------+-----------------------------------
n | | | |
Now the column I want to create, is the "Total_Previous_Points_P1" column shown above.
The way to create it:
For each row, check the date (call this DATE_VAL) and P1_id (call this ID_VAL)
Now, for all rows before DATE_VAL AND where P1 id == ID_VAL, sum up the previous points.
Put this sum in the final column, in the current row
Is there a fast pandas pythonic way to do this? My data set is very large.
Thank you!
The solution by SIA computes sum of Points_P1 including the
current value of Points_P1, whereas the requirement is to sum
previous points (for all rows before...).
Assuming that dates in each group are unique (in your sample they are),
the proper, pandasonic solution should include the following steps:
Sort by Date.
Group by P1_id, then for each group:
Take Points_P1 column.
Compute cumulative sum.
Subtract the current value of Points_P1.
So the whole code should be:
df['Total_Previous_Points_P1'] = df.sort_values('Date')\
.groupby(['P1_id']).Points_P1.cumsum() - df.Points_P1
Edit
If Date is not unique (within group of rows with some P1_id), the case
is more complicated, what can be shown on such source DataFrame:
Date Points_P1 P1_id
0 2016-11-09 5 100
1 2016-11-09 3 100
2 2015-10-08 5 100
3 2019-09-20 10 10000
4 2019-09-21 7 100
5 2019-07-10 12 10000
6 2019-12-10 12 10000
Note that for P1_id there are two rows for 2016-11-09.
In this case, start from computing "group" sums of previous points,
for each P1_id and Date:
sumPrev = df.groupby(['P1_id', 'Date']).Points_P1.sum()\
.groupby(level=0).apply(lambda gr: gr.shift(fill_value=0).cumsum())\
.rename('Total_Previous_Points_P1')
The result is:
P1_id Date
100 2015-10-08 0
2016-11-09 5
2019-09-21 13
10000 2019-07-10 0
2019-09-20 12
2019-12-10 22
Name: Total_Previous_Points_P1, dtype: int64
Then merge df with sumPrev on P1_id and Date (in sumPrev on the index):
df = pd.merge(df, sumPrev, left_on=['P1_id', 'Date'], right_index=True)
To show the result, it is more instructive to sort df also on ['P1_id', 'Date']:
Date Points_P1 P1_id Total_Previous_Points_P1
2 2015-10-08 5 100 0
0 2016-11-09 5 100 5
1 2016-11-09 3 100 5
4 2019-09-21 7 100 13
5 2019-07-10 12 10000 0
3 2019-09-20 10 10000 12
6 2019-12-10 12 10000 22
As you can see:
The first sum for each P1_id is 0 (no points from previous dates).
E.g. for both rows with Date == 2016-11-09 the sum of previous
points is 5 (which is in row for Date == 2015-10-08).
Try:
df['Total_Previous_Points_P1'] = df.groupby(['P1_id'])['Points_P1'].cumsum()
How It Works
First, it groups the data using P1_id feature.
Then it accesses the Points_P1 values on the grouped dataframe and apply the cumulative sum function cumsum(), which returns the sum of points up to and including the current row for each group.

Calculate new column in pandas dataframe based only on grouped records

I have a dataframe with various events(id) and following structure, the df is grouped by id and sorted on timestamp :
id | timestamp | A | B
1 | 02-05-2016|bla|bla
1 | 04-05-2016|bla|bla
1 | 05-05-2016|bla|bla
2 | 11-02-2015|bla|bla
2 | 14-02-2015|bla|bla
2 | 18-02-2015|bla|bla
2 | 31-03-2015|bla|bla
3 | 02-08-2016|bla|bla
3 | 07-08-2016|bla|bla
3 | 27-09-2016|bla|bla
Each timestamp-id combo indicates a different stage in the process of the event with that particular id. Each new record for a specific id indicates the start of a new stage for that event-id.
I would like to add a new column Duration that calculates the duration of each stage for each event (see desired df below). This is easy as i can simply calculate the difference between the timestamp of the next stage for the same event id with the timestamp of the current stage as following:
df['Start'] = pd.to_datetime(df['timestamp'])
df['End'] = pd.to_datetime(df['timestamp'].shift(-1))
df['Duration'] = df['End'] - df['Start']
My problem appears on the last stage of each event id, as i want to simply display NaNs or dashes as the stage has not finished yet and the end time is unknown. My solution simply takes the timestamp of the next row which is not always correct, as it might belong to a completele different event.
Desired output:
id | timestamp | A | B | Duration
1 | 02-05-2016|bla|bla| 2 days
1 | 04-05-2016|bla|bla| 1 days
1 | 05-05-2016|bla|bla| ------
2 | 11-02-2015|bla|bla| 3 days
2 | 14-02-2015|bla|bla| 4 days
2 | 18-02-2015|bla|bla| 41 days
2 | 31-03-2015|bla|bla| -------
3 | 02-08-2016|bla|bla| 5 days
3 | 07-08-2016|bla|bla| 50 days
3 | 27-09-2016|bla|bla| -------
I think this does what you want:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['Duration'] = df.groupby('id')['timestamp'].diff().shift(-1)
If I understand correctly: groupby('id') tells pandas to apply .diff().shift(-1) to each group as if it were a miniature DataFrame independent of the other rows. I tested it on this fake data:
import pandas as pd
import numpy as np
# Generate some fake data
df = pd.DataFrame()
df['id'] = [1]*5 + [2]*3 + [3]*4
df['timestamp'] = pd.to_datetime('2017-01-1')
duration = sorted(np.random.randint(30,size=len(df)))
df['timestamp'] += pd.to_timedelta(duration)
df['A'] = 'spam'
df['B'] = 'eggs'
but double-check just to be sure I didn't make a mistake!
Here is one approach using apply
def timediff(row):
row['timestamp'] = pd.to_datetime(row['timestamp'], format='%d-%m-%Y')
return pd.DataFrame(row['timestamp'].diff().shift(-1))
res = df.assign(duration=df.groupby('id').apply(timediff))
Output:
id timestamp duration
0 1 02-05-2016 2 days
1 1 04-05-2016 1 days
2 1 05-05-2016 NaT
3 2 11-02-2015 3 days
4 2 14-02-2015 4 days
5 2 18-02-2015 41 days
6 2 31-03-2015 NaT
7 3 02-08-2016 5 days
8 3 07-08-2016 51 days
9 3 27-09-2016 NaT

Categories

Resources