stack data based on column - python

i am in python i have a data frame like this contain sub_id refer to patient_id, hour_measure from 1 to 22 and other patient's measurement
subject_id | hour_measure heart rate | urinecolor | blood pressure
--------------------------------------------------------
3 | 1 40 | red | high
3 | 2 60 | red | high
3 | .. .. | .. | ..
3 | 22 90 | red | high
4 | 3 60 | yellow | low
4 | 3 60 | yellow | low
4 | 22 90 | red | high
i want to group sub_id measurement by max min skew,etc for numeric features and first and last value for categorical
i write the follwing code
df= pd.read_csv(path)
df1 = (df.groupby(['subject_id','hour_measure'])
.agg([ 'sum','min','max', 'median','var','skew']))
f = lambda x: next(iter(x.mode()), None)
cols = df.select_dtypes(object).columns
df2 = df.groupby(['subject_id','hour_measure'])[cols].agg(f)
df2.columns = pd.MultiIndex.from_product([df2.columns, ['mode']])
print (df2)
df3 = pd.concat([df1, df2], axis=1).unstack().reorder_levels([0,2,1],axis= 1)
print (df3)
df3.to_csv("newfile.csv")
it give me the grouping for every hour
i try to make it group only with subject id only
df1 = (df.groupby(['subject_id'])
.agg([ 'sum','min','max', 'median','var','skew']))
it also give me the same output , and calculate the statistics for every hour as follows
subject_id | heart rate_1 | heartrate_2 ....
--------------------------------------------------------
| min max mean | min max mean ....
3
4
i want the out put to be as the following
subject_id | heart rate | repiratotry rate |urine color
--------------------------------------------------------
| min | max | mean | min | max | mean ..|. first | last
3 50 60 55 40 65 20 | yellow | red
any one can tell how can i edit the code to give the wanted output
any help will appreciated

let me know if this gets you close to what you're looking for. I did not run into your issue with grouping by every hour so I'm not sure if I understood your question completely.
# sample dataframe
df = pd.DataFrame(
{
"subject_id": [1, 1, 1, 2, 2, 2, 3, 3, 3],
"hour_measure": [1, 22, 12, 5, 18, 21, 8, 18, 4],
"blood_pressure": [
"high",
"high",
"high",
"high",
"low",
"low",
"low",
"low",
"high",
],
}
)
# sort out numeric columns before aggregating them
numeric_result = (
df.select_dtypes(include="number")
.groupby(["subject_id"])
.agg(["min", "max", "mean"])
)
# sort out categorical columns before aggregating them
categorical_result = (
df.set_index(["subject_id"])
.select_dtypes(include="object")
.groupby(["subject_id"])
.agg(["first", "last"])
)
# combine numeric and categorical results
result = numeric_result.join(categorical_result)
hour_measure blood_pressure
min max mean first last
subject_id
1 1 22 11.666667 high high
2 5 21 14.666667 high low
3 4 18 10.000000 low high

Related

Very simple pandas column/row transform that I cannot figure out

I need to do a simple calculation on values in a dataframe, but I need some column transposed first. Once they are transposed I want to take the most recent amount / 2nd most recent amount and then the binary result if it less than or equal to .5
By most recent I mean most recent to the date in the Date 2 column
Have This
| Name | Amount | Date 1 | Date 2 |
| -----| ---- |------------------------|------------|
| Jim | 100 | 2021-06-10 | 2021-06-15 |
| Jim | 200 | 2021-05-11 | 2021-06-15 |
| Jim | 150 | 2021-03-5 | 2021-06-15 |
| Bob | 350 | 2022-06-10 | 2022-08-30 |
| Bob | 300 | 2022-08-12 | 2022-08-30 |
| Bob | 400 | 2021-07-6 | 2022-08-30 |
I Want this
| Name | Amount | Date 2| Most Recent Amount(MRA) | 2nd Most Recent Amount(2MRA) | MRA / 2MRA| Less than or equal to .5 |
| -----| -------|------------------------|----------------|--------------------|-------------|--------------------------|
| Jim | 100 | 2021-06-15 | 100 | 200 | .5 | 1 |
| Bob | 300 | 2022-08-30 | 300 | 350 | .85 | 0 |
This is the original dataframe.
df = pd.DataFrame({'Name':['Jim','Jim','Jim','Bob','Bob','Bob'],
'Amount':[100,200,150,350,300,400],
'Date 1':['2021-06-10','2021-05-11','2021-03-05','2022-06-10','2022-08-12','2021-07-06'],
'Date 2':['2021-06-15','2021-06-15','2021-06-15','2022-08-30','2022-08-30','2022-08-30']
})
And this is the results.
# here we take the gropby of the 'Name' column
g = df.sort_values('Date 1', ascending=False).groupby(['Name'])
# then we use the agg function to get the first of 'Date 2' and 'Amount' columns
# and then rename result of the 'Amount' column to 'MRA'
first = g.agg({'Date 2':'first','Amount':'first'}).rename(columns={'Amount':'MRA'}).reset_index()
# Similarly, we take the second values by applying a lambda function
second = g.agg({'Date 2':'first','Amount':lambda t: t.iloc[1]}).rename(columns={'Amount':'2MRA'}).reset_index()
df_T = pd.merge(first, second, on=['Name','Date 2'], how='left')
# then we use this function to add two desired columns
def operator(x):
return x['MRA']/x['2MRA'], [1 if x['MRA']/x['2MRA']<=.5 else 0][0]
# we apply the operator function to add 'MRA/2MRA' and 'Less than or equal to .5' columns
df_T['MRA/2MRA'], df_T['Less than or equal to .5'] = zip(*df_T.apply(operator, axis=1))
Hope this helps. :)
One way to do what you've asked is:
df = ( df[df['Date 1'] <= df['Date 2']]
.groupby('Name', sort=False)['Date 1'].nlargest(2)
.reset_index(level=0)
.assign(**{
'Amount': df.Amount,
'Date 2': df['Date 2'],
'recency': ['MRA','MRA2']*len(set(df.Name.tolist()))
})
.pivot(index=['Name','Date 2'], columns='recency', values='Amount')
.reset_index().rename_axis(columns=None) )
df = df.assign(**{'Amount':df.MRA, 'MRA / MRA2': df.MRA/df.MRA2})
df = df.assign(**{'Less than or equal to .5': (df['MRA / MRA2'] <= 0.5).astype(int)})
df = pd.concat([df[['Name', 'Amount']], df.drop(columns=['Name', 'Amount'])], axis=1)
Input:
Name Amount Date 1 Date 2
0 Jim 100 2021-06-10 2021-06-15
1 Jim 200 2021-05-11 2021-06-15
2 Jim 150 2021-03-05 2021-06-15
3 Bob 350 2022-06-10 2022-08-30
4 Bob 300 2022-08-12 2022-08-30
5 Bob 400 2021-07-06 2022-08-30
Output:
Name Amount Date 2 MRA MRA2 MRA / MRA2 Less than or equal to .5
0 Bob 300 2022-08-30 300 350 0.857143 0
1 Jim 100 2021-06-15 100 200 0.500000 1
Explanation:
Filter only for rows where Date 1 <= Date 2
Use groupby() and nlargest() to get the 2 most recent Date 1 values per Name
Use assign() to add back the Amount and Date 2 columns and create a recency column containing MRA and MRA2 for the pair of rows corresponding to each Name value
Use pivot() to turn the recency values MRA and MRA2 into column labels
Use reset_index() to restore Name and Date 2 to columns, and use rename_axis() to make the columns index anonymous
Use assign() once to restore Amount and add column MRA / MRA2, and again to add column named Less than or equal to .5
Use concat(), [] and drop() to rearrange the columns to match the output sequence shown in the question.
Here's the rough procedure you want:
sort_values by Name and Date 1 to get the data in order.
shift to get the previous date and 2nd most recent amount fields
Filter the dataframe for Date 1 <= Date 2.
group_by by Name and use head to get only the first row.
Now, your Amount column is your Most Recent Amount and your Shifted Amount column is the 2nd Most Recent amount. From there, you can do a simple division to get the ratio.

Subsetting data with a column condition

I have a dataframe which contains Date, Visitor_ID and Pages columns. In the Page_visited column there are different row wise entries for each dates. Please refer the below table to understand the data.
[| Dates | Visitor_ID| Pages |
|:------ |:---------:| -----: |
| 10/1/2021 | 1 | xy |
| 10/1/2021 | 1 | step2 |
|10/1/2021 | 1 | xx |
|10/1/2021 | 1 | NetBanking|
| 10/1/2021 | 2 | step1 |
| 10/1/2021 | 2 | xy |
|10/1/2021 | 3 | step1 |
|10/1/2021 | 3 | NetBanking|
|11/1/2021 | 4 | step1 |
|12/1/2021 | 4 | NetBanking|][1]
Desired output:
Date Visitor_ID
|10/1/2021 | 1 |
|10/1/2021 | 3 |
the output should be a subset of actual data where the condition is that if for same Visitor_ID the page contains string "step" before string "Netbanking in same date then return the Visitor ID.
To initialise your dataframe you could do:
import pandas as pd
columns = ["Dates", "Visitor_ID", "Pages"]
records = [
["10/1/2021", 1, "xy"],
["10/1/2021", 1, "step2"],
["10/1/2021", 1, "NetBanking"],
["10/1/2021", 2, "step1"],
["10/1/2021", 2, "xy"],
["10/1/2021", 3, "step1"],
["10/1/2021", 3, "NetBanking"],
["11/1/2021", 4, "step1"],
["12/1/2021", 4, "NetBanking"]]
data = pd.DataFrame().from_records(records, columns=columns)
data["Dates"] = pd.DatetimeIndex(data["Dates"])
index_names = columns[:2]
data.set_index(index_names, drop=True, inplace=True)
Note that I have left out your third line in the records, otherwise I cannot reproduce your desired output. I have made this a multi-index data frame in order to easily loop over the groups 'date/visitor'. The structure of the dataframe looks like:
print(data)
Pages
Dates Visitor_ID
2021-10-01 1 xy
1 step2
1 NetBanking
2 step1
2 xy
3 step1
3 NetBanking
2021-11-01 4 step1
2021-12-01 4 NetBanking
Now to select the customers from the same date and from the same group, I am going to loop over these groups and use 2 masks to select the required records:
for date_time, data_per_date in data.groupby(level=0):
for visitor, data_per_visitor in data_per_date.groupby(level=0):
# select the column with the Pages
pages = data_per_visitor["Pages"].str
# make 2 boolean masks, for the records with step and netbanking
has_step = pages.contains("step")
has_netbanking = pages.contains("NetBanking")
# to get the records after each 'step' records, apply a diff on 'has_step'
# Convert to int first for the correct result
# each diff with outcome -1 fulfills this requirement. Make a
# mask based on this requirement
diff_step = has_step.astype(int).diff()
records_after_step = diff_step == -1
# combine the 2 mask to create your final mask to make a selection
mask = records_after_step & has_netbanking
# select the records and print to screen
selection = data_per_visitor[mask]
if not selection.empty:
print(selection.reset_index()[index_names])
This gives the following output:
Dates Visitor_ID
0 2021-10-01 1
1 2021-10-01 3
EDIT:
I was reading your question again. The solution above assumed that only records with 'NetBanking' directly following a record with 'step' is valid. That is why I thought your example input was not corresponding with your desired output. However, in case you are allowing rows in between an occurrence with 'step' and the first 'netbanking', the solution does not work. In that case, it is better to explicitly iterate of the rows of your dataframe per date and client id. An example then would be:
for date_time, data_per_date in data.groupby(level=0):
for visitor, data_per_visitor in data_per_date.groupby(level=0):
after_step = False
index_selection = list()
data_per_visitor.reset_index(inplace=True)
for index, records in data_per_visitor.iterrows():
page = records["Pages"]
if "step" in page and not after_step:
after_step = True
if "NetBanking" in page and after_step:
index_selection.append(index)
after_step = False
selection = data_per_visitor.reindex(index_selection)
if not selection.empty:
print(selection.reset_index()[index_names]
Normally I would not recommend to use 'iterrows' as it is really slow, but in this case I don't see an easy other solution. The output of the second algorithm is the same as the first for my data. In case you do include the third line from your example data, the second algorithm still gives the same output.

Re name Columns Pandas

Hi I have create a new data frame based on groupby, mean and count as per below:
suburb_price = HH.groupby(['Suburb']).agg({'Price':['mean'],
'Suburb':['count']})
|Suburb |Price |Suburb
|mean |count
|0 |Austins Ferry |585,000 | 1
|1 |Battery Point |700,000 | 1
|2 |Bellerive |498,571 | 7
|3 |Berriedale |465,800 | 5
|4 |Blackmans Bay |625,000 | 1
and I want to change the name of the columns by using
suburb_price.reset_index(level=0,inplace=True)
suburb_price.rename(index={0:'Suburb Name',1:'Average Price',2:'Number of Properties'})
but it does not seem to work, not sure why
Your solution should working if rename columns by range with length of columns and pass to rename parameter columns:
suburb_price.reset_index(inplace=True)
suburb_price.columns = range(len(suburb_price.columns))
suburb_price = suburb_price.rename(columns={0:'Suburb Name',1:'Average Price',2:'Number of Properties'})
Simplier is set columns names by list:
suburb_price.reset_index(inplace=True)
suburb_price.columns = ['Suburb Name','Average Price','Number of Properties']
Another idea is use GroupBy.agg with named aggregations and rename column Suburb:
test1 = [[0,7,50], [0,3,51], [0,3,45], [1,5,50],[1,0,50],[2,6,50]]
HH = pd.DataFrame({'Suburb':list('aaabbc'), 'Price':[5,10,20,2,45,3]})
print (HH)
Suburb Price
0 a 5
1 a 10
2 a 20
3 b 2
4 b 45
5 c 3
suburb_price = (HH.groupby(['Suburb'])
.agg(**{'Average Price': ('Price', 'mean'),
'Number of Properties': ('Suburb', 'count')})
.reset_index()
.rename(columns={'Suburb':'Suburb Name'}))
print (suburb_price)
Suburb Name Average Price Number of Properties
0 a 11.666667 3
1 b 23.500000 2
2 c 3.000000 1

Python pandas: dynamically change values in unspecified number of columns

I have a simple data frame which might look like this:
| Label | Average BR_1 | Average BR_2 | Average BR_3 | Average BR_4 |
| ------- | ------------ | ------------ | ------------ | ------------ |
| Label 1 | 50 | 30 | 50 | 50 |
| Label 2 | 60 | 20 | 50 | 50 |
| Label 3 | 65 | 50 | 50 | 50 |
What I would like to be able to do is to add a % symbol in every column.
I know that I can do something like this for every column:
df['Average BR_1'] = df['Average BR_1'].astype(str) + '%'
However, the problem is, that I read in the data from a CSV file which might contain more of these columns, so instead of Average BR_1 to Average BR_4, it might contain Average BR_1 to say Average BR_10.
So I would like this change to happen automatically for every column which contains Average BR_ in its column name.
I have been reading about .loc but I managed only to change column values to an entirely new value like so:
df.loc[:, ['Average BR_1', 'Average BR_2']] = "Hello"
Also, I haven't yet been able to implement regex here.
I tried with a list:
colsArr = [c for c in df.columns if 'Average BR_' in c]
print(colsArr)
But I did not manage to implement this with .loc.
I suppose I could do this using a loop, but I feel like there must be some better pandas solution, but I can not figure it out.
Could you help and point me in the right direction?
Thank you
# extract the column names that need to be updated
cols = df.columns[df.columns.str.startswith('Average BR')]
# update the columns
df[cols] = df[cols].astype(str).add('%')
print(df)
Label Average BR_1 Average BR_2 Average BR_3 Average BR_4
0 Label 1 50% 30% 50% 50%
1 Label 2 60% 20% 50% 50%
2 Label 3 65% 50% 50% 50%
working example
You can use df.update and df.filter
df.update(df.filter(like='Average BR_').astype('str').add('%'))
df
Out:
Label Average BR_1 Average BR_2 Average BR_3 Average BR_4
0 Label 1 50% 30% 50% 50%
1 Label 2 60% 20% 50% 50%
2 Label 3 65% 50% 50% 50%

Calculate new column in pandas dataframe based only on grouped records

I have a dataframe with various events(id) and following structure, the df is grouped by id and sorted on timestamp :
id | timestamp | A | B
1 | 02-05-2016|bla|bla
1 | 04-05-2016|bla|bla
1 | 05-05-2016|bla|bla
2 | 11-02-2015|bla|bla
2 | 14-02-2015|bla|bla
2 | 18-02-2015|bla|bla
2 | 31-03-2015|bla|bla
3 | 02-08-2016|bla|bla
3 | 07-08-2016|bla|bla
3 | 27-09-2016|bla|bla
Each timestamp-id combo indicates a different stage in the process of the event with that particular id. Each new record for a specific id indicates the start of a new stage for that event-id.
I would like to add a new column Duration that calculates the duration of each stage for each event (see desired df below). This is easy as i can simply calculate the difference between the timestamp of the next stage for the same event id with the timestamp of the current stage as following:
df['Start'] = pd.to_datetime(df['timestamp'])
df['End'] = pd.to_datetime(df['timestamp'].shift(-1))
df['Duration'] = df['End'] - df['Start']
My problem appears on the last stage of each event id, as i want to simply display NaNs or dashes as the stage has not finished yet and the end time is unknown. My solution simply takes the timestamp of the next row which is not always correct, as it might belong to a completele different event.
Desired output:
id | timestamp | A | B | Duration
1 | 02-05-2016|bla|bla| 2 days
1 | 04-05-2016|bla|bla| 1 days
1 | 05-05-2016|bla|bla| ------
2 | 11-02-2015|bla|bla| 3 days
2 | 14-02-2015|bla|bla| 4 days
2 | 18-02-2015|bla|bla| 41 days
2 | 31-03-2015|bla|bla| -------
3 | 02-08-2016|bla|bla| 5 days
3 | 07-08-2016|bla|bla| 50 days
3 | 27-09-2016|bla|bla| -------
I think this does what you want:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['Duration'] = df.groupby('id')['timestamp'].diff().shift(-1)
If I understand correctly: groupby('id') tells pandas to apply .diff().shift(-1) to each group as if it were a miniature DataFrame independent of the other rows. I tested it on this fake data:
import pandas as pd
import numpy as np
# Generate some fake data
df = pd.DataFrame()
df['id'] = [1]*5 + [2]*3 + [3]*4
df['timestamp'] = pd.to_datetime('2017-01-1')
duration = sorted(np.random.randint(30,size=len(df)))
df['timestamp'] += pd.to_timedelta(duration)
df['A'] = 'spam'
df['B'] = 'eggs'
but double-check just to be sure I didn't make a mistake!
Here is one approach using apply
def timediff(row):
row['timestamp'] = pd.to_datetime(row['timestamp'], format='%d-%m-%Y')
return pd.DataFrame(row['timestamp'].diff().shift(-1))
res = df.assign(duration=df.groupby('id').apply(timediff))
Output:
id timestamp duration
0 1 02-05-2016 2 days
1 1 04-05-2016 1 days
2 1 05-05-2016 NaT
3 2 11-02-2015 3 days
4 2 14-02-2015 4 days
5 2 18-02-2015 41 days
6 2 31-03-2015 NaT
7 3 02-08-2016 5 days
8 3 07-08-2016 51 days
9 3 27-09-2016 NaT

Categories

Resources