How to apply group by Keys to the relevant Groups - python

I have a dataframe where i use group by to group them like follows
Name Nationality age
Peter UK 28
John US 29
Wiley UK 28
Aster US 29
grouped = self_ex_df.groupby([Nationality, age])
Now i want to attach a unique ID against each of these values
I am trying this though not sure it works?
uniqueID = 'ID_'+ grouped.groups.keys().astype(str)
uniqueID Name Nationality age
ID_UK28 Peter UK 28
ID_US29 John US 29
ID_UK28 Wiley UK 28
ID_US29 Aster US 29
I want to now combine this into a new DF to something like this
uniqueID Nationality age Text
ID_UK28 UK 28 Peter and Whiley have a combined age of 56
ID_US_29 US 29 John and Aster have a combined age of 58
How do i achieve the above?

Hopefully close enough, couldn't get average age:
import pandas as pd
#create dataframe
df = pd.DataFrame({'Name': ['Peter', 'John', 'Wiley', 'Aster'], 'Nationality': ['UK', 'US', 'UK', 'US'], 'age': [28, 29, 28, 29]})
#make uniqueID
df['uniqueID'] = 'ID_' + df['Nationality'] + df['age'].astype(str)
#groupby has agg method that can take dict and preform multiple aggregations
df = df.groupby(['uniqueID', 'Nationality']).agg({'age': 'sum', 'Name': lambda x: ' and '.join(x)})
#to get text you just combine new Name and sum of age
df['Text'] = df['Name'] + ' have a combined age of ' + df['age'].astype(str)

You don't need the groupby to create the uniqueID, and you can groupby that uniqueID later to get the groups based on age and nationality. I used a custom function to build the text str. This one way of doing it.
df1 = df.assign(uniqueID='ID_'+df.Nationality+df.age.astype(str))
def myText(x):
str = ' and '.join(x.Name)
str += ' have a combined age of {}.'.format(x.age.sum())
return str
df2 = df1.groupby(['uniqueID', 'Nationality','age']).apply(myText).reset_index().rename(columns={0:'Text'})
print(df2)
Output:
uniqueID Nationality age Text
0 ID_UK28 UK 28 Peter and Wiley have a combined age of 56.
1 ID_US29 US 29 John and Aster have a combined age of 58.

Related

Window function equivalent with filter clause in Python pandas

I am trying to figure out a problem and am running a test example on a dummy dataset - built here
import pandas as pd
data = [['tom', 30, 'sales', 5], ['nick', 35, 'sales', 8], ['juli', 24, 'marketing', 4], ['franz', 40, 'marketing', 6], ['jon', 50, 'marketing', 6], ['jeremie', 60, 'marketing', 6]]
df = pd.DataFrame(data, columns=['Name', 'Age', 'Department', 'Tenure'])
For each row, I want to find the mean age of everyone else in the department who is older than the row in question, for example Tom (30) in sales, should return the mean of his age and Nick, who is older, so 32.5 as the mean age, but for Nick it should return 35 as Tom in his department is younger than him. The code below achieves that - but I am looking for a quicker more efficient way?!
#Dynamically get mean, where age is greater than the line in question - almost definitely a better
#way of doing this though
def sumWindow(group):
x = group['Age'].mean()
group['Mean Dept Age'] = x
return group
Name = []
Age = []
Department = []
Tenure = []
MeanDeptAge = []
for index, row in df.iterrows():
n = row['Name']
a = row['Age']
df_temp = df[df['Age'] >= a]
df_present = df_temp.groupby(df['Department']).apply(sumWindow)
df_present['Relevant Name'] = n
df_final = df_present[df_present['Name'] == df_present['Relevant Name']]
Name.append(df_final.iloc[0,0])
Age.append(df_final.iloc[0,1])
Department.append(df_final.iloc[0,2])
Tenure.append(df_final.iloc[0,3])
MeanDeptAge.append(df_final.iloc[0,4])
del df_final
df_final = pd.DataFrame({'Name': Name,
'Age': Age,
'Department': Department,
'Tenure': Tenure,
'Mean Department Age - Greater Than Emp Age': MeanDeptAge,
})
df_final
Thanks!
I have tried lots of different solutions filtering within the groupby clause etc
Use a grouped expanding.mean on the DataFrame sorted in descending order or Age:
df['Mean Department Age - Greater Than Emp Age'] = (df
.sort_values(by='Age', ascending=False)
.groupby('Department')['Age']
.expanding().mean()
.droplevel(0)
)
NB. this would handle potential duplicated ages based on order, you should define how you want to proceed if this happens in your real use case.
Output:
Name Age Department Tenure Mean Department Age - Greater Than Emp Age
0 tom 30 sales 5 32.5
1 nick 35 sales 8 35.0
2 juli 24 marketing 4 43.5
3 franz 40 marketing 6 50.0
4 jon 50 marketing 6 55.0
5 jeremie 60 marketing 6 60.0
def function1(dd:pd.DataFrame):
dd1=dd.sort_values("Age",ascending=False).Age.expanding().mean()
return dd1.rename("Mean Department Age - Greater Than Emp Age")
df.join(df.groupby('Department').apply(function1).droplevel(0))
out
Name Age Department Tenure Mean Department Age - Greater Than Emp Age
0 tom 30 sales 5 32.5
1 nick 35 sales 8 35.0
2 juli 24 marketing 4 43.5
3 franz 40 marketing 6 50.0
4 jon 50 marketing 6 55.0
5 jeremie 60 marketing 6 60.0

Fill pandas dataframe with dictionary elements

I have a dataframe df structured as well:
Name Surname Nationality
Joe Tippy Italian
Adam Wesker American
I would like to create a new record based on a dictionary whose keys corresponds to the column names:
new_record = {'Name': 'Jimmy', 'Surname': 'Turner', 'Nationality': 'Australian'}
How can I do that? I tried with a simple:
df = df.append(new_record, ignore_index=True)
but if I have a missing value in my record the dataframe doesn't get filled with a space, instead it leaves me the last column empty.
IIUC replace missing values in next step:
new_record = {'Surname': 'Turner', 'Nationality': 'Australian'}
df = pd.concat([df, pd.DataFrame([new_record])], ignore_index=True).fillna('')
print (df)
Name Surname Nationality
0 Joe Tippy Italian
1 Adam Wesker American
2 Turner Australian
Or use DataFrame.reindex:
df = pd.concat([df, pd.DataFrame([new_record])].reindex(df.columns, fill_value='', axis=1), ignore_index=True)
A simple way if you have a range index:
df.loc[len(df)] = new_record
Updated dataframe:
Name Surname Nationality
0 Joe Tippy Italian
1 Adam Wesker American
2 Jimmy Turner Australian
If you have a missing key (for example 'Surname'):
Name Surname Nationality
0 Joe Tippy Italian
1 Adam Wesker American
2 Jimmy NaN Australian
If you want empty strings:
df.loc[len(df)] = pd.Series(new_record).reindex(df.columns, fill_value='')
Output:
Name Surname Nationality
0 Joe Tippy Italian
1 Adam Wesker American
2 Jimmy Australian

Faster way to query & compute in Pandas [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have two dataframes in Pandas. What I want achieve is, grab every 'Name' from DF1 and get the corresponding 'City' and 'State' present in DF2.
For example, 'Dwight' from DF1 should return corresponding values 'Miami' and 'Florida' from DF2.
DF1
Name Age Student
0 Dwight 20 Yes
1 Michael 30 No
2 Pam 55 No
. . . .
70000 Jim 27 Yes
DF1 has approx 70,000 rows with 3 columns
Second Dataframe, DF2 has approx 320,000 rows.
Name City State
0 Dwight Miami Florida
1 Michael Scranton Pennsylvania
2 Pam Austin Texas
. . . . .
325082 Jim Scranton Pennsylvania
Currently I have two functions, which return the values of 'City' and 'State' using a filter.
def read_city(id):
filt = (df2['Name'] == id)
if filt.any():
field = (df2[filt]['City'].values[0])
else:
field = ""
return field
def read_state(id):
filt = (df2['Name'] == id)
if filt.any():
field = (df2[filt]['State'].values[0])
else:
field = ""
return field
I am using the apply function to process all the values.
df['city_list'] = df['Name'].apply(read_city)
df['State_list'] = df['Name'].apply(read_state)
The result takes a long time to compute in the above way. It roughly takes me around 18 minutes to get back the df['city_list'] and df['State_list'].
Is there a faster to compute this ? Since I am completely new to pandas, I would like to know if there is a efficient way to compute this ?
I believe you can do a map:
s = df2.groupby('name')[['City','State']].agg(list)
df['city_list'] = df['Name'].map(s['City'])
df['State_list'] = df['Name'].map(s['State'])
Or a left merge after you got s:
df = df.merge(s.add_suffix('_list'), left_on='Name', right_index=True, how='left')
I think you can do something like this:
# Dataframe DF1 (dummy data)
DF1 = pd.DataFrame(columns=['Name', 'Age', 'Student'], data=[['Dwight', 20, 'Yes'], ['Michael', 30, 'No'], ['Pam', 55, 'No'], ['Jim', 27, 'Yes']])
print("DataFrame DF1")
print(DF1)
# Dataframe DF2 (dummy data)
DF2 = pd.DataFrame(columns=['Name', 'City', 'State'], data=[['Dwight', 'Miami', 'Florida'], ['Michael', 'Scranton', 'Pennsylvania'], ['Pam', 'Austin', 'Texas'], ['Jim', 'Scranton', 'Pennsylvania']])
print("DataFrame DF2")
print(DF2)
# You do a merge on 'Name' column and then, you change the name of columns 'City' and 'State'
df = pd.merge(DF1, DF2, on=['Name']).rename(columns={'City': 'city_list', 'State': 'State_list'})
print("DataFrame final")
print(df)
Output:
DataFrame DF1
Name Age Student
0 Dwight 20 Yes
1 Michael 30 No
2 Pam 55 No
3 Jim 27 Yes
DataFrame DF2
Name City State
0 Dwight Miami Florida
1 Michael Scranton Pennsylvania
2 Pam Austin Texas
3 Jim Scranton Pennsylvania
DataFrame final
Name Age Student city_list State_list
0 Dwight 20 Yes Miami Florida
1 Michael 30 No Scranton Pennsylvania
2 Pam 55 No Austin Texas
3 Jim 27 Yes Scranton Pennsylvania

Converting python list of delimited items in to a pandas data frame

I have a list like this where items are separated by ":".
x=['john:42:engineer',
'michael:29:doctor']
Is there an way to change this in to a data frame like below by defining columns Name, Age and Occupation?
Name Age Occupation
0 john 42 engineer
1 michael 29 doctor
You can just use split:
pd.DataFrame([y.split(':') for y in x], columns = ['Name','Age', 'Occupation'])
Output:
Name Age Occupation
0 john 42 engineer
1 michael 29 doctor
I will do
df = pd.Series(x).str.split(':',expand=True)
df.columns = ['Name','Age', 'Occupation']
df
Out[172]:
Name Age Occupation
0 john 42 engineer
1 michael 29 doctor
Not sure this is the best approach, but...
x = ['john:42:engineer', 'michael:29:doctor']
x = [i.split(':') for i in x]
pd.DataFrame({'name': [i[0] for i in x], 'age': [i[2] for i in x], 'occupation': [i[1] for i in x]})
Output:
name age occupation
0 john 42 engineer
1 michael 29 doctor

Groupby one column and count another column with a condition?

I was wondering if it is possible to groupby one column while counting the values of another column that fulfill a condition. Because my dataset is a bit weird, I created a similar one:
import pandas as pd
raw_data = {'name': ['John', 'Paul', 'George', 'Emily', 'Jamie'],
'nationality': ['USA', 'USA', 'France', 'France', 'UK'],
'books': [0, 15, 0, 14, 40]}
df = pd.DataFrame(raw_data, columns = ['name', 'nationality', 'books'])
Say, I want to groupby the nationality and count the number of people that don't have any books (books == 0) from that country.
I would therefore expect something like the following as output:
nationality
USA 1
France 1
UK 0
I tried most variations of groupby, using filter, agg but don't seem to get anything that works.
Thanks in advance,
BBQuercus :)
IIUC:
df.books.eq(0).astype(int).groupby(df.nationality).sum()
nationality
France 1
UK 0
USA 1
Name: books, dtype: int64
Use:
df.groupby('nationality')['books'].apply(lambda x: x.eq(0).any().astype(int))
nationality
France 1
UK 0
USA 1
Name: books, dtype: int64

Categories

Resources