I have a dataset containing the following columns:
['sex', 'age', 'relationship_status]
There are some NaN values in 'relationship_status' column and I want to replace them with the most common value in each group based on age and gender.
I know how to groupby and count the values:
df2.groupby(['age','sex'])['relationship_status'].value_counts()
and it returns:
age sex relationship_status
17.0 female Married with kids 1
18.0 female In relationship 5
Married 4
Single 4
Married with kids 2
male In relationship 9
Single 5
Married 4
Married with kids 4
Divorced 3
.
.
.
86.0 female In relationship 1
92.0 male Married 1
97.0 male In relationship 1
So again, what I need to achieve is that whenever "relationship_status" is empty I need the program to replace it with the most frequent value based on persons age and gender.
Can anyone suggest how can I do it?
Kind regards.
Something like this:
mode = df2.groupby(['age','sex'])['relationship_status'].agg(lambda x: pd.Series.mode(x)[0])
df2['relationship_status'].fillna(mode, inplace=True)
Check this, it returns 'ALL_NAN' when within (age,sex) subgroups are only nans:
import pandas as pd
df = pd.DataFrame(
{'age': [25, 25, 25, 25, 25, 25,],
'sex': ['F', 'F', 'F', 'M', 'M', 'M', ],
'status': ['married', np.nan, 'married', np.nan, np.nan, 'single']
})
df.loc[df['status'].isna(), 'status'] = df.groupby(['age','sex'])['status'].transform(lambda x: x.mode()[0] if any(x.mode()) else 'ALL_NAN')
Output:
age sex status
0 25 F married
1 25 F married
2 25 F married
3 25 M single
4 25 M single
5 25 M single
Related
I am trying to figure out a problem and am running a test example on a dummy dataset - built here
import pandas as pd
data = [['tom', 30, 'sales', 5], ['nick', 35, 'sales', 8], ['juli', 24, 'marketing', 4], ['franz', 40, 'marketing', 6], ['jon', 50, 'marketing', 6], ['jeremie', 60, 'marketing', 6]]
df = pd.DataFrame(data, columns=['Name', 'Age', 'Department', 'Tenure'])
For each row, I want to find the mean age of everyone else in the department who is older than the row in question, for example Tom (30) in sales, should return the mean of his age and Nick, who is older, so 32.5 as the mean age, but for Nick it should return 35 as Tom in his department is younger than him. The code below achieves that - but I am looking for a quicker more efficient way?!
#Dynamically get mean, where age is greater than the line in question - almost definitely a better
#way of doing this though
def sumWindow(group):
x = group['Age'].mean()
group['Mean Dept Age'] = x
return group
Name = []
Age = []
Department = []
Tenure = []
MeanDeptAge = []
for index, row in df.iterrows():
n = row['Name']
a = row['Age']
df_temp = df[df['Age'] >= a]
df_present = df_temp.groupby(df['Department']).apply(sumWindow)
df_present['Relevant Name'] = n
df_final = df_present[df_present['Name'] == df_present['Relevant Name']]
Name.append(df_final.iloc[0,0])
Age.append(df_final.iloc[0,1])
Department.append(df_final.iloc[0,2])
Tenure.append(df_final.iloc[0,3])
MeanDeptAge.append(df_final.iloc[0,4])
del df_final
df_final = pd.DataFrame({'Name': Name,
'Age': Age,
'Department': Department,
'Tenure': Tenure,
'Mean Department Age - Greater Than Emp Age': MeanDeptAge,
})
df_final
Thanks!
I have tried lots of different solutions filtering within the groupby clause etc
Use a grouped expanding.mean on the DataFrame sorted in descending order or Age:
df['Mean Department Age - Greater Than Emp Age'] = (df
.sort_values(by='Age', ascending=False)
.groupby('Department')['Age']
.expanding().mean()
.droplevel(0)
)
NB. this would handle potential duplicated ages based on order, you should define how you want to proceed if this happens in your real use case.
Output:
Name Age Department Tenure Mean Department Age - Greater Than Emp Age
0 tom 30 sales 5 32.5
1 nick 35 sales 8 35.0
2 juli 24 marketing 4 43.5
3 franz 40 marketing 6 50.0
4 jon 50 marketing 6 55.0
5 jeremie 60 marketing 6 60.0
def function1(dd:pd.DataFrame):
dd1=dd.sort_values("Age",ascending=False).Age.expanding().mean()
return dd1.rename("Mean Department Age - Greater Than Emp Age")
df.join(df.groupby('Department').apply(function1).droplevel(0))
out
Name Age Department Tenure Mean Department Age - Greater Than Emp Age
0 tom 30 sales 5 32.5
1 nick 35 sales 8 35.0
2 juli 24 marketing 4 43.5
3 franz 40 marketing 6 50.0
4 jon 50 marketing 6 55.0
5 jeremie 60 marketing 6 60.0
I'm a beginner python coder, I want to build a python function that calculate a specific indicator,
as example, the data is look like:
ID status Age Gender
01 healthy 16 Male
02 un_healthy 14 Female
03 un_healthy 22 Male
04 healthy 12 Female
05 healthy 33 Female
To build a function that calculate the percentage of healthy people by healthy+un_health
def health_rate(healthy, un_healthy,age){
if (age >= 15):
if (gender == "Male"):
return rateMale= (count(healthy)/count(healthy)+count(un_healthy))
Else
return rateFemale= (count(healthy)/count(healthy)+count(un_healthy))
Else
return print("underage");
and then just use .apply
but the logic isn't right, I still not get my desired output
I want to return Male rate and Female rate
You could use pivot_table (df your dataframe):
df = df[df.Age >= 15].pivot_table(
index="status", columns="Gender", values="ID",
aggfunc="count", margins=True, fill_value=0
)
Result for your sample dataframe:
Gender Female Male All
status
healthy 1 1 2
un_healthy 0 1 1
All 1 2 3
If you want percentages:
df = (df / df.loc["All", :] * 100).drop("All")
Result:
Gender Female Male All
status
healthy 100.0 50.0 66.666667
un_healthy 0.0 50.0 33.333333
df[col_name].value_counts(normalize=True) gives you the proportions for the desired column. Here's how you can parameterize it:
def health_percentages(df, col_name):
return df[col_name].value_counts(normalize=True)*100
Example:
data = [ [1, 'healthy',16,'M'], [2, 'un_healthy',14,'F'], [3, 'un_healthy', 22, 'M'],[4, 'healthy', 12, 'F'],[5, 'healthy', 33, 'F']]
df = pd.DataFrame(data, columns = ['ID','status', 'Age', 'Gender'])
print(df)
print(health_percentages(df, 'status'))
#output:
ID status Age Gender
0 1 healthy 16 M
1 2 un_healthy 14 F
2 3 un_healthy 22 M
3 4 healthy 12 F
4 5 healthy 33 F
healthy 60.0
un_healthy 40.0
I have a Pandas dataframe similar to this one:
age name sex
0 30 jon male
1 blue php null
2 18 jane female
3 orange c++ null
and I am trying to concatenate every second row to the previous one adding extra columns:
age name sex colour language other
0 30 jon male blue php null
1 18 jane female orange c++ null
I tried shift() but was duplicating every row.
How can this be done?
You can create a new dataframe by slicing the dataframe using iloc with a step of 2:
cols = ['age', 'name', 'sex']
new_cols = ['colour', 'language', 'other']
d = dict()
for col, ncol in zip(cols, new_cols):
d[col] = df[col].iloc[::2].values
d[ncol] = df[col].iloc[1::2].values
pd.DataFrame(d)
Result:
age colour name language sex other
0 30 blue jon PHP male NaN
1 18 orange jane c++ female NaN
TRY:
df = pd.concat([df.iloc[::2].reset_index(drop=True), pd.DataFrame(
df.iloc[1::2].values, columns=['colour', 'language', 'other'])], 1)
OUTPUT:
age name sex colour language other
0 30 jon male blue php NaN
1 18 jane female orange c++ NaN
Reshape the values and create a new dataframe
pd.DataFrame(df.values.reshape(-1, df.shape[1] * 2),
columns=['age', 'name', 'sex', 'colour', 'language', 'other'])
age name sex colour language other
0 30 jon male blue php NaN
1 18 jane female orange c++ NaN
This question already has answers here:
Insert a row to pandas dataframe
(18 answers)
Closed 4 years ago.
Below is my dataframe
import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex':['male','male','female','male']})
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
I want to insert a new row at the first position
name: dean, age: 45, sex: male
age name sex
0 45 dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male
What is the best way to do this in pandas?
Probably this is not the most efficient way but:
df.loc[-1] = ['45', 'Dean', 'male'] # adding a row
df.index = df.index + 1 # shifting index
df.sort_index(inplace=True)
Output:
age name sex
0 45 Dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male
If it's going to be a frequent operation, then it makes sense (in terms of performance) to gather the data into a list first and then use pd.concat([], ignore_index=True) (similar to #Serenity's solution):
Demo:
data = []
# always inserting new rows at the first position - last row will be always on top
data.insert(0, {'name': 'dean', 'age': 45, 'sex': 'male'})
data.insert(0, {'name': 'joe', 'age': 33, 'sex': 'male'})
#...
pd.concat([pd.DataFrame(data), df], ignore_index=True)
In [56]: pd.concat([pd.DataFrame(data), df], ignore_index=True)
Out[56]:
age name sex
0 33 joe male
1 45 dean male
2 30 jon male
3 25 sam male
4 18 jane female
5 26 bob male
PS I wouldn't call .append(), pd.concat(), .sort_index() too frequently (for each single row) as it's pretty expensive. So the idea is to do it in chunks...
#edyvedy13's solution worked great for me. However it needs to be updated for the deprecation of pandas' sort method - now replaced with sort_index.
df.loc[-1] = ['45', 'Dean', 'male'] # adding a row
df.index = df.index + 1 # shifting index
df = df.sort_index() # sorting by index
Use pandas.concat and reindex new dataframe:
import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex':['male','male','female','male']})
# new line
line = pd.DataFrame({'name': 'dean', 'age': 45, 'sex': 'male'}, index=[0])
# concatenate two dataframe
df2 = pd.concat([line,df.ix[:]]).reset_index(drop=True)
print (df2)
Output:
age name sex
0 45 dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male
import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex': ['male','male','female','male']})
df1 = pd.DataFrame({'name': ['dean'], 'age': [45], 'sex':['male']})
df1 = df1.append(df)
df1 = df1.reset_index(drop=True)
That works
This will work for me.
>>> import pandas as pd
>>> df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
... 'age': [30,25,18,26],
... 'sex':['male','male','female','male']}) >>> df
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
>>> df.loc['a']=[45,'dean','male']
>>> df
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
a 45 dean male
>>> newIndex=['a']+[ind for ind in df.index if ind!='a']
>>> df=df.reindex(index=newIndex)
>>> df
age name sex
a 45 dean male
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
I have a list called 'gender', of which I counted all the occurrences of the values with Counter:
gender = ['2',
'Female,',
'All Female Group,',
'All Male Group,',
'Female,',
'Couple,',
'Mixed Group,'....]
gender_count = Counter(gender)
gender_count
Counter({'2': 1,
'All Female Group,': 222,
'All Male Group,': 119,
'Couple,': 256,
'Female,': 1738,
'Male,': 2077,
'Mixed Group,': 212,
'NA': 16})
I want to put this dict into a pandas Dataframe. I have used pd.series(Convert Python dict into a dataframe):
s = pd.Series(gender_count, name='gender count')
s.index.name = 'gender'
s.reset_index()
Which gives me the dataframe I want, but I don't know how to save these steps into a pandas DataFrame.
I also tried using DataFrame.from_dict()
s2 = pd.DataFrame.from_dict(gender_count, orient='index')
But this creates a dataframe with the categories of gender as the index.
I eventually want to use gender categories and the count for a piechart.
Skip the intermediate step
gender = ['2',
'Female',
'All Female Group',
'All Male Group',
'Female',
'Couple',
'Mixed Group']
pd.value_counts(gender)
Female 2
2 1
Couple 1
Mixed Group 1
All Female Group 1
All Male Group 1
dtype: int64
In [21]: df = pd.Series(gender_count).rename_axis('gender').reset_index(name='count')
In [22]: df
Out[22]:
gender count
0 2 1
1 All Female Group, 222
2 All Male Group, 119
3 Couple, 256
4 Female, 1738
5 Male, 2077
6 Mixed Group, 212
7 NA 16
what about just
s = pd.DataFrame(gender_count)