Re name Columns Pandas - python

Hi I have create a new data frame based on groupby, mean and count as per below:
suburb_price = HH.groupby(['Suburb']).agg({'Price':['mean'],
'Suburb':['count']})
|Suburb |Price |Suburb
|mean |count
|0 |Austins Ferry |585,000 | 1
|1 |Battery Point |700,000 | 1
|2 |Bellerive |498,571 | 7
|3 |Berriedale |465,800 | 5
|4 |Blackmans Bay |625,000 | 1
and I want to change the name of the columns by using
suburb_price.reset_index(level=0,inplace=True)
suburb_price.rename(index={0:'Suburb Name',1:'Average Price',2:'Number of Properties'})
but it does not seem to work, not sure why

Your solution should working if rename columns by range with length of columns and pass to rename parameter columns:
suburb_price.reset_index(inplace=True)
suburb_price.columns = range(len(suburb_price.columns))
suburb_price = suburb_price.rename(columns={0:'Suburb Name',1:'Average Price',2:'Number of Properties'})
Simplier is set columns names by list:
suburb_price.reset_index(inplace=True)
suburb_price.columns = ['Suburb Name','Average Price','Number of Properties']
Another idea is use GroupBy.agg with named aggregations and rename column Suburb:
test1 = [[0,7,50], [0,3,51], [0,3,45], [1,5,50],[1,0,50],[2,6,50]]
HH = pd.DataFrame({'Suburb':list('aaabbc'), 'Price':[5,10,20,2,45,3]})
print (HH)
Suburb Price
0 a 5
1 a 10
2 a 20
3 b 2
4 b 45
5 c 3
suburb_price = (HH.groupby(['Suburb'])
.agg(**{'Average Price': ('Price', 'mean'),
'Number of Properties': ('Suburb', 'count')})
.reset_index()
.rename(columns={'Suburb':'Suburb Name'}))
print (suburb_price)
Suburb Name Average Price Number of Properties
0 a 11.666667 3
1 b 23.500000 2
2 c 3.000000 1

Related

Very simple pandas column/row transform that I cannot figure out

I need to do a simple calculation on values in a dataframe, but I need some column transposed first. Once they are transposed I want to take the most recent amount / 2nd most recent amount and then the binary result if it less than or equal to .5
By most recent I mean most recent to the date in the Date 2 column
Have This
| Name | Amount | Date 1 | Date 2 |
| -----| ---- |------------------------|------------|
| Jim | 100 | 2021-06-10 | 2021-06-15 |
| Jim | 200 | 2021-05-11 | 2021-06-15 |
| Jim | 150 | 2021-03-5 | 2021-06-15 |
| Bob | 350 | 2022-06-10 | 2022-08-30 |
| Bob | 300 | 2022-08-12 | 2022-08-30 |
| Bob | 400 | 2021-07-6 | 2022-08-30 |
I Want this
| Name | Amount | Date 2| Most Recent Amount(MRA) | 2nd Most Recent Amount(2MRA) | MRA / 2MRA| Less than or equal to .5 |
| -----| -------|------------------------|----------------|--------------------|-------------|--------------------------|
| Jim | 100 | 2021-06-15 | 100 | 200 | .5 | 1 |
| Bob | 300 | 2022-08-30 | 300 | 350 | .85 | 0 |
This is the original dataframe.
df = pd.DataFrame({'Name':['Jim','Jim','Jim','Bob','Bob','Bob'],
'Amount':[100,200,150,350,300,400],
'Date 1':['2021-06-10','2021-05-11','2021-03-05','2022-06-10','2022-08-12','2021-07-06'],
'Date 2':['2021-06-15','2021-06-15','2021-06-15','2022-08-30','2022-08-30','2022-08-30']
})
And this is the results.
# here we take the gropby of the 'Name' column
g = df.sort_values('Date 1', ascending=False).groupby(['Name'])
# then we use the agg function to get the first of 'Date 2' and 'Amount' columns
# and then rename result of the 'Amount' column to 'MRA'
first = g.agg({'Date 2':'first','Amount':'first'}).rename(columns={'Amount':'MRA'}).reset_index()
# Similarly, we take the second values by applying a lambda function
second = g.agg({'Date 2':'first','Amount':lambda t: t.iloc[1]}).rename(columns={'Amount':'2MRA'}).reset_index()
df_T = pd.merge(first, second, on=['Name','Date 2'], how='left')
# then we use this function to add two desired columns
def operator(x):
return x['MRA']/x['2MRA'], [1 if x['MRA']/x['2MRA']<=.5 else 0][0]
# we apply the operator function to add 'MRA/2MRA' and 'Less than or equal to .5' columns
df_T['MRA/2MRA'], df_T['Less than or equal to .5'] = zip(*df_T.apply(operator, axis=1))
Hope this helps. :)
One way to do what you've asked is:
df = ( df[df['Date 1'] <= df['Date 2']]
.groupby('Name', sort=False)['Date 1'].nlargest(2)
.reset_index(level=0)
.assign(**{
'Amount': df.Amount,
'Date 2': df['Date 2'],
'recency': ['MRA','MRA2']*len(set(df.Name.tolist()))
})
.pivot(index=['Name','Date 2'], columns='recency', values='Amount')
.reset_index().rename_axis(columns=None) )
df = df.assign(**{'Amount':df.MRA, 'MRA / MRA2': df.MRA/df.MRA2})
df = df.assign(**{'Less than or equal to .5': (df['MRA / MRA2'] <= 0.5).astype(int)})
df = pd.concat([df[['Name', 'Amount']], df.drop(columns=['Name', 'Amount'])], axis=1)
Input:
Name Amount Date 1 Date 2
0 Jim 100 2021-06-10 2021-06-15
1 Jim 200 2021-05-11 2021-06-15
2 Jim 150 2021-03-05 2021-06-15
3 Bob 350 2022-06-10 2022-08-30
4 Bob 300 2022-08-12 2022-08-30
5 Bob 400 2021-07-06 2022-08-30
Output:
Name Amount Date 2 MRA MRA2 MRA / MRA2 Less than or equal to .5
0 Bob 300 2022-08-30 300 350 0.857143 0
1 Jim 100 2021-06-15 100 200 0.500000 1
Explanation:
Filter only for rows where Date 1 <= Date 2
Use groupby() and nlargest() to get the 2 most recent Date 1 values per Name
Use assign() to add back the Amount and Date 2 columns and create a recency column containing MRA and MRA2 for the pair of rows corresponding to each Name value
Use pivot() to turn the recency values MRA and MRA2 into column labels
Use reset_index() to restore Name and Date 2 to columns, and use rename_axis() to make the columns index anonymous
Use assign() once to restore Amount and add column MRA / MRA2, and again to add column named Less than or equal to .5
Use concat(), [] and drop() to rearrange the columns to match the output sequence shown in the question.
Here's the rough procedure you want:
sort_values by Name and Date 1 to get the data in order.
shift to get the previous date and 2nd most recent amount fields
Filter the dataframe for Date 1 <= Date 2.
group_by by Name and use head to get only the first row.
Now, your Amount column is your Most Recent Amount and your Shifted Amount column is the 2nd Most Recent amount. From there, you can do a simple division to get the ratio.

how to apply multiple conditions and append to the same table in one dataframe in pyspark and sql

I am trying to do this loop with week by week reduction. if the id and class_gp combo already appear in the previous week, then those combinations will remove for the current week/future week.
The data frame looks like this
df1:
ID Week_ID class_gp school_nm
1 20200101 A 101
1 20200101 B 101
1 20200107 A 101
1 20200107 B 101
1 20200107 C 101
1 20200114 B 101
1 20200114 D 101
1 20200121 B 101
1 20200121 D 101
1 20200121 E 101
The ideal output should look like this:
ID Week_ID class_gp school_nm
1 20200101 A 101
1 20200101 B 101
1 20200107 C 101
1 20200114 D 101
1 20200121 E 101
I am not very good with for loop, so I used the most stupid way by creating data frame for each week, then join them all.
remove week1's id and class_gp combo for the rest of weeks
t1 = df1.where("week_id = '20200101'")
df2 = df1.join(t1,
[df1.id == t1.id,df1.class_gp == t1.class_gp],
how='left_anti')
remove week2's id and class_gp combo for the rest of weeks
t2 = df2.where("week_id= '20200107'")
df3 = df2.join(t2,
[df2.id == t2.id,df2.class_gp == t2.class_gp],
how='left_anti'
)....
and create all 18 weeks like that.
but creating so much data frame and running like that make it really really slow.
I wonder is there an easy way to create a single data frame look like the ideal output.
you could use a window function to achieve it:
val windowSpec=Window.partitionBy("class","school").orderBy("week")
and then apply a row_number function to the window and select the row with row=1 like below
scala> school.withColumn("row", row_number().over(windowSpec)).where("row=1").orderBy("week","class").drop("row").show(false)
+---+--------+-----+------+
|id |week |class|school|
+---+--------+-----+------+
|1 |20200101|A |101 |
|1 |20200101|B |101 |
|1 |20200107|C |101 |
|1 |20200114|D |101 |
|1 |20200121|E |101 |
+---+--------+-----+------+
You neeed just 2 lines of Python code
df=df.sort_values(by=['Week_ID'], ascending=True)
df=df.drop_duplicates(subset=['ID','class_gp'], keep='first')

stack data based on column

i am in python i have a data frame like this contain sub_id refer to patient_id, hour_measure from 1 to 22 and other patient's measurement
subject_id | hour_measure heart rate | urinecolor | blood pressure
--------------------------------------------------------
3 | 1 40 | red | high
3 | 2 60 | red | high
3 | .. .. | .. | ..
3 | 22 90 | red | high
4 | 3 60 | yellow | low
4 | 3 60 | yellow | low
4 | 22 90 | red | high
i want to group sub_id measurement by max min skew,etc for numeric features and first and last value for categorical
i write the follwing code
df= pd.read_csv(path)
df1 = (df.groupby(['subject_id','hour_measure'])
.agg([ 'sum','min','max', 'median','var','skew']))
f = lambda x: next(iter(x.mode()), None)
cols = df.select_dtypes(object).columns
df2 = df.groupby(['subject_id','hour_measure'])[cols].agg(f)
df2.columns = pd.MultiIndex.from_product([df2.columns, ['mode']])
print (df2)
df3 = pd.concat([df1, df2], axis=1).unstack().reorder_levels([0,2,1],axis= 1)
print (df3)
df3.to_csv("newfile.csv")
it give me the grouping for every hour
i try to make it group only with subject id only
df1 = (df.groupby(['subject_id'])
.agg([ 'sum','min','max', 'median','var','skew']))
it also give me the same output , and calculate the statistics for every hour as follows
subject_id | heart rate_1 | heartrate_2 ....
--------------------------------------------------------
| min max mean | min max mean ....
3
4
i want the out put to be as the following
subject_id | heart rate | repiratotry rate |urine color
--------------------------------------------------------
| min | max | mean | min | max | mean ..|. first | last
3 50 60 55 40 65 20 | yellow | red
any one can tell how can i edit the code to give the wanted output
any help will appreciated
let me know if this gets you close to what you're looking for. I did not run into your issue with grouping by every hour so I'm not sure if I understood your question completely.
# sample dataframe
df = pd.DataFrame(
{
"subject_id": [1, 1, 1, 2, 2, 2, 3, 3, 3],
"hour_measure": [1, 22, 12, 5, 18, 21, 8, 18, 4],
"blood_pressure": [
"high",
"high",
"high",
"high",
"low",
"low",
"low",
"low",
"high",
],
}
)
# sort out numeric columns before aggregating them
numeric_result = (
df.select_dtypes(include="number")
.groupby(["subject_id"])
.agg(["min", "max", "mean"])
)
# sort out categorical columns before aggregating them
categorical_result = (
df.set_index(["subject_id"])
.select_dtypes(include="object")
.groupby(["subject_id"])
.agg(["first", "last"])
)
# combine numeric and categorical results
result = numeric_result.join(categorical_result)
hour_measure blood_pressure
min max mean first last
subject_id
1 1 22 11.666667 high high
2 5 21 14.666667 high low
3 4 18 10.000000 low high

Python - Groupby a DataFrameGroupBy object

I have a panda dataframe in Python at which I am applying a groupby. And then I want to apply a new groupby + sum on the previous result. To be more specific, first I am doing:
check_df = data_df.groupby(['hotel_code', 'dp_id', 'market', 'number_of_rooms'])
[['market', 'number_of_rooms']]
And then I want to do:
check_df = check_df.groupby(['market'])['number_of_rooms'].sum()
So, I am getting the following error:
AttributeError: Cannot access callable attribute 'groupby' of 'DataFrameGroupBy'
objects, try using the 'apply' method
My initial data look like that:
hotel_code | market | number_of_rooms | ....
---------------------------------------------
001 | a | 200 | ...
001 | a | 200 |
002 | a | 300 | ...
Notice that I may have duplicates of pairs like (a - 200), that's why I want need the first groupby.
What I want in the end is something like that:
Market | Rooms
--------------
a | 3000
b | 250
I'm just trying to translate the following sql query into python:
select a.market, sum(a.number_of_rooms)
from (
select market, number_of_rooms
from opinmind_dev..cg_mm_booking_dataset_full
group by hotel_code, market, number_of_rooms
) as a
group by market ;
Any ideas how I can fix that? If you need any more info, let me know.
ps. I am new to Python and data science
IIUC, instead of:
check_df = data_df.groupby(['hotel_code', 'dp_id', 'market', 'number_of_rooms'])
[['market', 'number_of_rooms']]
You should simply do:
check_df = data_df.drop_duplicates(subset=['hotel_code', 'dp_id', 'market', 'number_of_rooms'])\
.loc[:, ['market', 'number_of_rooms']]\
.groupby('market')\
.sum()
df = pd.DataFrame({'Market': [1,1,1,2,2,2,3,3], 'Rooms':range(8), 'C':np.random.rand(8)})
Market Rooms C
0 1 0 0.187793
1 1 1 0.325284
2 1 2 0.095147
3 2 3 0.296781
4 2 4 0.022262
5 2 5 0.201078
6 3 6 0.160082
7 3 7 0.683151
You need to move the column selection away from the grouped DataFrame. Either of the following should work.
df.groupby('Market').sum()[['Rooms']]
df[['Rooms']].groupby(df['Market']).sum()
Rooms
Market
1 3
2 12
3 13
If you select using ['Rooms'] instead of [['Rooms']] you will get a Series instead of a DataFrame.
The dataframes produced use market as their index. If you want to convert it to a normal data column, use:
df.reset_index()
Market Rooms
0 1 3
1 2 12
2 3 13
If I understand your question correctly, You could simply do -
data_df.groupby('Market').agg({'Rooms': np.sum}) OR
data_df.groupby(['market'], as_index=False).agg({'Rooms': np.sum})
data_df = pd.DataFrame({'Market' : ['A','B','C','B'],
'Hotel' : ['H1','H2','H4','H5'],
'Rooms' : [20,40,50,34]
})
data_df.groupby('Market').agg({'Rooms': np.sum})

Calculate new column in pandas dataframe based only on grouped records

I have a dataframe with various events(id) and following structure, the df is grouped by id and sorted on timestamp :
id | timestamp | A | B
1 | 02-05-2016|bla|bla
1 | 04-05-2016|bla|bla
1 | 05-05-2016|bla|bla
2 | 11-02-2015|bla|bla
2 | 14-02-2015|bla|bla
2 | 18-02-2015|bla|bla
2 | 31-03-2015|bla|bla
3 | 02-08-2016|bla|bla
3 | 07-08-2016|bla|bla
3 | 27-09-2016|bla|bla
Each timestamp-id combo indicates a different stage in the process of the event with that particular id. Each new record for a specific id indicates the start of a new stage for that event-id.
I would like to add a new column Duration that calculates the duration of each stage for each event (see desired df below). This is easy as i can simply calculate the difference between the timestamp of the next stage for the same event id with the timestamp of the current stage as following:
df['Start'] = pd.to_datetime(df['timestamp'])
df['End'] = pd.to_datetime(df['timestamp'].shift(-1))
df['Duration'] = df['End'] - df['Start']
My problem appears on the last stage of each event id, as i want to simply display NaNs or dashes as the stage has not finished yet and the end time is unknown. My solution simply takes the timestamp of the next row which is not always correct, as it might belong to a completele different event.
Desired output:
id | timestamp | A | B | Duration
1 | 02-05-2016|bla|bla| 2 days
1 | 04-05-2016|bla|bla| 1 days
1 | 05-05-2016|bla|bla| ------
2 | 11-02-2015|bla|bla| 3 days
2 | 14-02-2015|bla|bla| 4 days
2 | 18-02-2015|bla|bla| 41 days
2 | 31-03-2015|bla|bla| -------
3 | 02-08-2016|bla|bla| 5 days
3 | 07-08-2016|bla|bla| 50 days
3 | 27-09-2016|bla|bla| -------
I think this does what you want:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['Duration'] = df.groupby('id')['timestamp'].diff().shift(-1)
If I understand correctly: groupby('id') tells pandas to apply .diff().shift(-1) to each group as if it were a miniature DataFrame independent of the other rows. I tested it on this fake data:
import pandas as pd
import numpy as np
# Generate some fake data
df = pd.DataFrame()
df['id'] = [1]*5 + [2]*3 + [3]*4
df['timestamp'] = pd.to_datetime('2017-01-1')
duration = sorted(np.random.randint(30,size=len(df)))
df['timestamp'] += pd.to_timedelta(duration)
df['A'] = 'spam'
df['B'] = 'eggs'
but double-check just to be sure I didn't make a mistake!
Here is one approach using apply
def timediff(row):
row['timestamp'] = pd.to_datetime(row['timestamp'], format='%d-%m-%Y')
return pd.DataFrame(row['timestamp'].diff().shift(-1))
res = df.assign(duration=df.groupby('id').apply(timediff))
Output:
id timestamp duration
0 1 02-05-2016 2 days
1 1 04-05-2016 1 days
2 1 05-05-2016 NaT
3 2 11-02-2015 3 days
4 2 14-02-2015 4 days
5 2 18-02-2015 41 days
6 2 31-03-2015 NaT
7 3 02-08-2016 5 days
8 3 07-08-2016 51 days
9 3 27-09-2016 NaT

Categories

Resources